Sorting and Hashing

SORTING AND HASHING
Sorting by Selection- Sorting by Insertion- Sorting by Exchange- Sorting by Diminishing

Increment- Heap Sort- Heaps Maintaining the Heap Property-Building a Heap- Heap sort
Algorithm-Quick sort - Description-Performance of quick sort-Analysis of Quick Sort.
Hashing - General idea-Hash functions- Separate Chaining-Open Addressing-Rehashing-
Extendible Hashing
SORTING
Practically, all data processing activities require data to be in some order.

Ordering or sorting data in an increasing or decreasing fashion according to some linear
relationship among data items is of fundamental importance.
Sorting: Sorting is an operation of arranging data, in some given order, such as

increasing or decreasing with numerical data, or alphabetically with character data.
Let A be a list of n elements A1, A2,An in memory. Sorting A refers to the operation of
rearranging the contents of A so that they are increasing in order (numerically or
lexicographically), that is,
A1 A2 A3 .An
Sorting methods can be characterized into two broad categories:

Internal Sorting
External Sorting
Internal Sorting : Internal sorting methods are the methods that can be used when the
list to be sorted is small enough so that the entire sort can be carried out in main
memory.
The key principle of internal sorting is that all the data items to be sorted are
retained in the main memory and random access in this memory space can be effectively
used to sort the data items.
The various internal sorting methods are:

1. Bubble Sort
2. Selection Sort
3. Insertion Sort
4. Quick Sort
5. Merge Sort
6. Heap Sort
External Sorting : External sorting methods are the methods to be used when the list to
be sorted is large and cannot be accommodated entirely in the main memory. In this
case some of the data is present in the main memory and some is kept in auxiliary
memory such as hard disk, floppy disk, tape, etc.
The key principle of external sorting is to move data from secondary storage to
main memory in large blocks for ordering the data.
Criteria for the selection of a sorting method.
The important criteria for the selection of a sorting method for the given set of data items
are as follows:
1. Programming time of the sorting algorithm

2. Execution time of the program
3. Memory space needed for the programming environment
Objectives involved in design of sorting algorithms.
The main objectives involved in the design of sorting algorithm are:
1. Minimum number of exchanges

2. Large volume of data block movement
This implies that the designed and desired sorting algorithm must employ minimum
number of exchanges and the data should be moved in large blocks, which in turn
increase the efficiency of the sorting algorithm.
INTERNAL SORTING
Internal Sorting: These are the methods which are applied on the list of data items,
which are small enough to be accommodated in the main memory.
There are different types of internal sorting methods. The methods discussed here
sort the data in ascending order. With a minor change we can sort the data in descending
order.
SELECTION SORT
One of the easiest ways to sort a list is by selection. Beginning with the first
element in the list, a search is performed to locate the smallest element. When this
element is found, it is interchanged with the first element in the list. This interchange
places the smallest element first. A search for the second smallest key is then carried out.
This is accomplished by examining the list from second element onwards. The second
smallest element is interchanged with the element present in the second position of the
list. This process of searching for the smallest element and placing it in proper position
continues until all the elements are sorted in ascending order.
Principle: The Selection sort algorithm searches for the smallest element in the list and
places it in the first position. Then it searches for the second smallest element and places
that in the second position. This is repeated until all the elements are sorted.
Algorithm:
Procedure SELECTIONSORT(A, N)
// A is the array containing the list of data items
// N is the number of data items in the list
Last N -1
Repeat For Pass = 0 to Last 1 Step 1
Min Pass
Repeat For I = Pass + 1 to Last Step 1
If A[I] < A[Min]
Then
Min I
End If
End Repeat
If Min Pass
Then
A[Pass] A[Min]
End If
End Repeat
End SELECTIONSORT
In Selection sort algorithm, Last is made to point to the last element and pass to
the first element. In every pass the min is made to point to where pass is pointing.
Therefore initially in every pass A[pass]=A[min]. Now A[min] is compared with rest of
the elements and if any of the element from the rest of the list is found lesser than
A[min], then min is made to point to that element. This process is continued till the last
element after which the min points to the smallest element. If min is not equal to pass
then A[min] and A[pass] are swapped. Now the smallest element comes to the first
position. Now pass is incremented and the same process is repeated to get the smallest
number which is placed in the second position. This is repeated until pass is moved to
the last element. Finally, a sorted list of elements is obtained.
Example:
N = 10 Number of elements in the list
L Last
P Pass
M Min
i = 0 i =1 i = 2 i = 3 i = 4 i = 5 i=6 i=7 i=8 i=9
M=3
42 23 74 11 65 58 94 36 99 87
P=0 P M swap A[P] and A[M] L=9
M=1
11 23 74 42 65 58 94 36 99 87
P=1 P = M No change L=9
M=7
11 23 74 42 65 58 94 36 99 87
M=3
11 23 36 42 65 58 94 74 99 87
M=5
11 23 36 42 65 58 94 74 99 87
M=5
11 23 36 42 58 65 94 74 99 87
M=7
11 23 36 42 58 65 94 74 99 87
P M swap A[P] and A[M] P=6 L=9
M=9
11 23 36 42 58 65 74 94 99 87
M=9
11 23 36 42 58 65 74 87 99 94
Sorted List:
11 23 36 42 58 65 74 87 94 99
Program:
void array::sort()
{
int temp, last=count-1, min;
for (int pass=0; pass<last;pass++)
{
min=pass;
for (int i=pass+1; i<=last;i++)
{
if (a[i]<a[min])
min=i;
}
if (min!=pass)
{
temp=a[min];
a[min]=a[pass];
a[pass]=temp; } }}
In the above program, last is a variable to point to the last element and min is a
variable to point to the minimum number in the list. In each pass, the min is assigned
pass. Now a[min] is compared with all the other elements in the list and if any other
number is found to be minimum then min is made to point to that element. After every
pass it is checked if min is equal to pass. If not, then a[pass] is swapped with a[min] and
the next pass is continued. Finally a sorted list of elements is obtained.
Advantages:
1. It is simple and straight forward. The algorithm just selects the smallest element
every time and places it in the correct position.
2. Reduces the number of exchanges and hence efficient.
3. Faster than Bubble sort algorithm.
Disadvantages:
1. Requires an extra variable to keep track of the minimum number in every pass.
INSERTION SORT
The main idea behind the insertion sort is to insert the ith element in its correct
place in the ith pass. Suppose an array A with n elements A[1], A[2],A[N] is in
memory. The insertion sort algorithm scans A from A[1] to A[N], inserting each element
A[K] into its proper position in the previously sorted subarray A[1], A[2],..A[K-1].
Principle: In Insertion Sort algorithm, each element A[K] in the list is compared with all
the elements before it ( A[1] to A[K-1]). If any element A[I] is found to be greater than
A[K] then A[K] is inserted in the place of A[I}. This process is repeated till all the
elements are sorted.
Algorithm:
Procedure INSERTIONSORT(A, N)

Last N 1
Repeat For Pass = 1 to Last Step 1

Repeat For I = 0 to Pass 1 Step 1
If A[Pass] < A[I]
Then
Temp A[Pass]
Repeat For J = Pass -1 to I Step -1
A[J +1] A[J]
End Repeat
A[I] Temp
End If
End Repeat
End Repeat
End INSERTIONSORT
In Insertion Sort algorithm, Last is made to point to the last element in the list and
Pass is made to point to the second element in the list. In every pass the Pass is
incremented to point to the next element and is continued till it reaches the last element.
During each pass A[Pass] is compared all elements before it. If A[Pass] is lesser than
A[I] in the list, then A[Pass] is inserted in position I. Finally, a sorted list is obtained.
For performing the insertion operation, a variable temp is used to safely store
A[Pass] in it and then shift right elements starting from A[I] to A[Pass-1].
Example:

L Last
P Pass
i=0 i =1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9
42 23 74 11 65 58 94 36 99 87
P=1 A[P] < A[0] Insert A[P] at 0 L=9
23 42 74 11 65 58 94 36 99 87
P=2 L=9
A[P] is greater than all elements before it. Hence No Change
23 42 74 11 65 58 94 36 99 87
P=3 A[P] < A[0] Insert A[P] at 0L=9
11 23 42 74 65 58 94 36 99 87
P=4 L=9
A[P] < A[3] Insert A[P] at 3
11 23 42 65 74 58 94 36 99 87
P=5 L=9
11 23 42 58 65 74 94 36 99 87
P=6 L=9
11 23 42 58 65 74 94 36 99 87
P=7 L=9
11 23 36 42 58 65 74 94 99 87
P=8 L=9
11 23 36 42 58 65 74 94 99 87
P, L=9
Sorted List:
11 23 36 42 58 65 74 87 94 99
Program:
void array::sort()
{
int temp, last=count-1;
for (int pass=1; pass<=last;pass++)
{
for (int i=0; i<pass; i++)
{
if (a[pass]<a[i])
{
temp=a[pass];
for (int j=pass-1;j>=i;j--)
a[j+1]=a[j];
a[i]=temp;
}
}
}
}
In the sort function, the integer variable last is used to point to the last element in
the list. The first pass starts with the variable pass pointing to the second element and
continues till pass reaches the last element. In each pass, a[pass] is compared with all the
elements before it and if a[pass] is lesser than a[i], then it is inserted in position i. Before
inserting it, the elements a[i] to a[pass-1] are shifted right using a temporary variable.
Advantages:
1. Sorts the list faster when the list has less number of elements.
2. Efficient in cases where a new element has to be inserted into a sorted list.
Disadvantages:
1. Very slow for large values of n.
2. Poor performance if the list is in almost reverse order.
EXCHANGE SORT / BUBBLE SORT
This is the most commonly used sorting method. Exchange sorts attempt to
improve the ordering by comparing elements in pairs and interchanging them if they are
not in sorted order. This operation is repeated until the table is sorted. Algorithms differ in
how they systematically choose the two elements to be compared. The following method
compares adjacent elements and is known as Bubble Sort. The bubble sort derives its
name from the fact that the smallest data item bubbles up to the top of the sorted array.
Principle: The bubble sort method compares the two adjacent elements starting from the
start of the list and swaps the two if they are out of order. This is continued up to the last
element in the list and after each pass, a check is made to determine whether any
interchanges were made during the pass. If no interchanges occurred, then the list must
be sorted and no further passes are required.
Algorithm:
Procedure BUBBLESORT( A, N )

Last N 1
While Last > 0

Exch 0
Repeat For I = 0 to Last Step 1

If A[I] > A[I+1]
Then
A[I] A[I+1]
Exch 1
End If
End Repeat
If Exch = 1
Then
Exit Loop
Else
Last Last 1
End If
End While
End BUBBLESORT
In Bubble sort algorithm, initially Last is made to point the last element of the list
and Exch flag is assumed 0. Starting from the first element, the adjacent elements in the
list are compared. If they are found out of order then they are swapped immediately and
Exch flag is set to 1. This comparison is continued until the last two elements are
compared. After this pass, the Exch flag is checked to determine whether any exchange
has taken place. If no exchange has taken place then the control comes out of the loop
and the procedure comes to an end as the list is sorted. If any exchange has taken place
during the pass, the last pointer is decremented by 1 and the next pass is continued. This
process continues until list is sorted.
Example:
L Points to last element ( Last )
Pass 1
i=0 i =1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9
42 23 74 11 65 58 94 36 99 87
Out of order Swap L=9
23 42 74 11 65 58 94 36 99 87
23 42 11 74 65 58 94 36 99 87
23 42 11 65 74 58 94 36 99 87
23 42 11 65 58 74 94 36 99 87
23 42 11 65 58 74 36 94 99 87
Pass 2
23 42 11 65 58 74 36 94 87 99
23 11 42 65 58 74 36 94 87 99
23 11 42 58 65 74 36 94 87 99
23 11 42 58 65 36 74 94 87 99
Pass 3
23 11 42 58 65 36 74 87 94 99
23 11 42 58 65 36 74 87 94 99
Pass 4
23 11 42 58 36 65 74 87 94 99
11 23 42 58 36 65 74 87 94 99
Pass 5
11 23 42 36 58 65 74 87 94 99
Pass 6
Adjacent numbers are compared up to L=4. But no swapping takes place. As there was
no swapping taken place in this pass, the procedure comes to an end and we get a sorted
list:
11 23 36 42 58 65 74 87 94 99
Program:
// Bubble sort
#include <iostream.h>
#include <conio.h>
const int MAX=20;
class array
{
private:
int a[MAX];
int count;
public:
array();
void add(int n);
void sort();
void display();
};
array::array()
{
count=0;
for (int i=0; i<MAX; i++)
a[i]=0;
}
void array::add(int n)
{
if (count<MAX)
{
a[count]=n;
count++;
}
else
cout<<"\nArray is full";
}
void array::sort()
{
int temp, exch,last=count-1;
while (last>0)
{
exch=0;
for (int i=0; i<last;i++)
{
if (a[i]>a[i+1])
{
temp=a[i];
a[i]=a[i+1];
a[i+1]=temp;
exch=1;
}
}
if (exch==0)
return;
else
last--;
}
}
void array::display()
{
for (int i=0;i<count;i++)
cout<<a[i]<<"\t";
}
void main()
{
array list;
int v;
clrscr();
for (int i=0; i<10; i++)

{
cout<<"Enter number:;
cin>>v;
list.add(v);
}
cout<<"\nArray before sorting:\n";
list.display();
list.sort();
cout<<"\nArray after sorting:\n";
list.display();
getch();
}
A class array is declared with integer data a[MAX] and count. a[MAX] is the
array containing the list of data items, MAX is an integer constant giving the maximum
limit of the array and count is a variable to keep track of how many data items are stored
in the array by the user.
The member functions of the class array are add( ), sort( ), display( ) and the
constructor array( ). The constructor takes care of the initialization of the array elements
and the count variable. The add( ) function gets the data items from the user and adds it
to the existing array of elements. Each time the user enters a data item the count variable
is incremented once to keep track of how many elements the user is entering into the list.
Before adding the data to the array, it is checked whether count is less than MAX, so that
the array limit is not exceeded.
The sort( ) function sorts the data items in the array in the ascending order. A
variable last is declared to point to the last element of the list in every pass. The first two
adjacent elements are compared and if they are found out of order they are swapped else
next two adjacent elements are compared. This process repeats till the last two elements
in the list. A variable exch is initially set as 0 and is used as flag to determine whether an
exchange has taken place. If exchange is done then the exch flag is assigned 0. After
every pass exch flag is checked. If it is 0, then control returns back to the main function,
else last pointer is decremented once and next pass is continued. In each pass the greatest
elements in the list is pushed to the last position and the smaller elements bubble up or
move up.
A new function display( ) is used to display all elements in the array. In the main(
) function, an object list of class array is declared. The add( ) function is called in the
loop and 10 numbers are got from the user and stored in the array. The list with unsorted
elements is displayed first by calling the display( ) function and then the sorting is done
by calling the sort( ) function. Again the list with sorted elements is displayed to the user
by calling the display( ) function.
Advantages:
1. Simple and works well for list with less number of elements.
Disadvantages:
1. Inefficient when the list has large number of elements.
2. Requires more number of exchanges for every pass.
DIMINISHING INCREMENT SORT
Shellsort, named after its inventor, Donald Shell, was one of the first algorithms to break
the quadratic time barrier, although it was not until several years after its initial discovery
that a subquadratic time bound was proven. As suggested in the previous section, it works
by comparing elements that are distant; the distance between comparisons decreases as
the algorithm runs until the last phase, in which adjacent elements are compared. For this
reason, Shellsort is sometimes referred to as diminishing increment sort.
Shellsort uses a sequence, h1, h2, . . . , ht, called the increment sequence. Any
increment sequence will do as long as h1 = 1, but some choices are better than others.
After a phase, using some increment hk, for every i, we have a[i] a[i + hk] (where this
makes sense); all elements spaced hk apart are sorted. The file is then said to be hk-
sorted. For example, Figure shows an array after several phases of Shellsort. An
important property of Shellsort (which we state without proof) is that an hk-sorted file
that is then hk1-sorted remains hk-sorted. If this were not the case, the algorithm would
likely be of little value, since work done by early phases would be undone by later
phases.
The general strategy to hk-sort is for each position, i, in hk, hk + 1, . . . ,N 1,
place the element in the correct spot among i, i hk, i 2hk, and so on. Although this
does not affect the implementation, a careful examination shows that the action of an hk-
sort is to perform an insertion sort on hk independent subarrays. This observation will be
important when we analyze the running time of Shellsort.
ALGORITHM FOR SHELL SORT

A popular (but poor) choice for increment sequence is to use the sequence
suggested by Shell: ht = _N/2_, and hk = _hk+1/2_. Figure contains a function that
implements Shellsort using this sequence. We shall see later that there are increment
sequences that give a significant improvement in the algorithms running time; even a
minor change can drastically affect performance. The program in Figure avoids the
explicit use of swaps in the same manner as our implementation of insertion sort.
QUICK SORT
Quick sort is a very popular sorting method. The name comes from the fact that,
in general, quick sort can sort a list of data elements significantly faster than any of the
common sorting algorithms. This algorithm is based on the fact that it is faster and easier
to sort two small lists than one larger one. The basic strategy of quick sort is to divide
and conquer. Quick sort is also known as partition exchange sort.
The purpose of the quick sort is to move a data item in the correct direction just
enough for it to reach its final place in the array. The method, therefore, reduces
unnecessary swaps, and moves an item a great distance in one move.
Principle: A pivotal item near the middle of the list is chosen, and then items on either
side are moved so that the data items on one side of the pivot element are smaller than
the pivot element, whereas those on the other side are larger. The middle or the pivot
element is now in its correct position. This procedure is then applied recursively to the 2
parts of the list, on either side of the pivot element, until the whole list is sorted.
Algorithm:
Procedure QUICKSORT(A, Lower, Upper)

// Lower is the lower bound of the array
// Upper is the upper bound of the array
If Lower Upper
Then
Return
End If
I = Lower + 1
J = Upper + 1
While I < J
While A[I] < A[Lower]
II+1
End While
While A[J] > A[Lower]

JJ1
End While
If I < J
Then
A[I] A[J]
End If
End While
A[J] A[Lower]
QUICKSORT(A, Lower, J 1)
QUICKSORT(A, J + 1, Upper)
End QUICKSORT
In Quick sort algorithm, Lower points to the first element in the list and the Upper
points to the last element in the list. Now I is made to point to the next location of Lower
and J is made to point to the Upper. A[Lower] is considered as the pivot element and at
the end of the pass, the correct position of the pivot element is fixed. Keep incrementing
I and stop when A[I] > A[Lower]. When I stops, start decrementing J and stop when A[J]
< A[Lower]. Now check if I < J. If so, swap A[I] and A[J] and continue moving I and J
in the same way. When I meets J the control comes out of the loop and A[J] and
A[Lower] are swapped. Now the element at position J is at correct position and hence
split the list into two partitions: (A{Lower] to A[J-1] and A[J+1] to A[Upper] ). Apply
the Quick sort algorithm recursively on these individual lists. Finally, a sorted list is
obtained.
Example:
U Upper
L Lower
i=0 i =1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9
42 23 74 11 65 58 94 36 99 87
L=0 I=0 U, J=9
Initially I=L+1 and J=U, A[L]=42 is the pivot element.
42 23 74 11 65 58 94 36 99 87
L=0 I=2 J=7 U=9
A[2] > A[L] hence I stops at 2. A[7] < A[L] hence J stops at 7
I < J Swap A[I] and A[J]
42 23 36 11 65 58 94 74 99 87
L=0 J=3 I=4 U=9
A[4] > A[L] hence I stops at 4. A[3] < A[L] hence J stops at 3
I > J Swap A[J] and A[L]. Thus 42 go to correct position.
The list is partitioned into two lists as shown. The same process is applied to these lists
individually as shown.
List 1 List 2
11 23 36 42 65 58 94 74 99 87
L, J=0 I=1 U=2
(applying quicksort to list 1)
11 23 36 42 65 58 94 74 99 87
L, J=1 U, I=2
11 23 36 42 65 58 94 74 99 87
L=4 J=5 I=6 U=9
(applying quicksort to list 2)
11 23 36 42 58 65 94 74 99 87
L=6 I=8 U, J=9
11 23 36 42 58 65 94 74 87 99
L=6 J=8 U, I=9
11 23 36 42 58 65 87 74 94 99
L=6 U, I, J=7
Sorted List:
11 23 36 42 58 65 74 87 94 99
Program:
void array::sort(int lower, int upper)

{
if (lower>=upper)
return;
int temp, i=lower+1, j=upper;
while (i<=j)
{
while (a[i]<a[lower]) i++;
while (a[j]>a[lower]) j--;
if (i<j)
{
temp=a[i];
a[i]=a[j];
a[j]=temp;
}
}
temp=a[lower];
a[lower]=a[j];
a[j]=temp;
sort(lower, j-1);
sort(j+1, upper);
}
In the above program the function sort( ) is invoked by passing the lower bound
and the upper bound values of the array. This is a recursive function and hence it is
checked initially if lower is greater than equal to upper. If so, the function is terminated.
This is checked in the beginning itself to avoid endless loop. Initially i assigned lower+1
and j is assigned upper. i is incremented until a bigger number than the pivot element
(a[lower]) is found. j is decremented until a smaller number than the pivot element is
found. If both i and j stops, then it is checked whether i < j. If i < j then a[i] and a[j] are
swapped and the process is continued. When i and j cross each other, the exits and
a[lower] is swapped with a[j]. The position of element at j is fixed and hence the list is
partitioned into two ( a[Lower] to a[j-1] and a[j+1] to a[upper]). The quick sort algorithm
is applied recursively on these two lists. Finally, a sorted list is obtained.
Advantages:
1. Faster than any other commonly used sorting algorithms.
2. It has a best average case behavior.
Disadvantages:
1. As it uses recursion, stack space consumption is high.
ANALYSIS OF QUICK SORT

Like mergesort, quicksort is recursive; therefore, its analysis requires solving a
recurrence formula. We will do the analysis for a quicksort, assuming a random pivot (no
medianof three partitioning) and no cutoff for small arrays. We will take T(0) = T(1) = 1,
as in mergesort. The running time of quicksort is equal to the running time of the two
recursive calls plus the linear time spent in the partition (the pivot selection takes only
constant time).
This gives the basic quicksort relation
T(N) = T(i) + T(N i 1) + Cn (7.1)
where i = |S1| is the number of elements in S1. We will look at three cases.
Worst-Case Analysis
The pivot is the smallest element, all the time. Then i = 0, and if we ignore T(0) = 1,
which is insignificant, the recurrence is
T(N) = T(N 1) + cN, N > 1 (7.2)
We telescope, using Equation (7.2) repeatedly. Thus,
T(N 1) = T(N 2) + c(N 1) (7.3)
T(N 2) = T(N 3) + c(N 2) (7.4)
...
T(2) = T(1) + c(2) (7.5)
Adding up all these equations yields
as claimed earlier. To see that this is the worst possible case, note that the total cost of all
the partitions in recursive calls at depth d must be at most N. Since the recursion depth is
at most N, this gives an O(N2) worst-case bound for quicksort.
HEAP SORT
Heap: A Heap is a compete binary tree with the property that the value at each node is
at least as large as ( or as small as ) the values at its children (if they exist). If the value
at the parent node is larger than the values on its children then it is called a Max heap
and if the value at the parent node is smaller than the values on its children then it is
called the Min heap.
If a given node is in position I then the position of the left child and the right child
can be calculated using Left (L) = 2I and Right (R) = 2I + 1.
To check whether the right child exists or not, use the condition R N. If true,
Right child exists otherwise not.
The last node of the tree is N/2. After this position tree has only leaves.
Principle: The Max heap has the greatest element in the root. Hence the element in the
root node is pushed to the last position in the array and the remaining elements are
converted into a max heap. The root node of this new max heap will be the second
largest element and hence pushed to the last but one position in the array. This process is
repeated till all the elements get sorted.
Algorithm:
Procedure WALKDOWN(A, I, N)
// A is the list of unsorted elements

// N is the number of elements in the array
// I is the position of the node where the walkdown procedure is to be applied.
While I N/2
L 2I, R 2I + 1
If A[L] > A[I]
Then
ML
Else
MI
End If
If A[R] > A[M] and R N
Then
MR
End If
If M I
Then
A[I] A[M]
IM
Else
Return
End If
End While
End WALKDOWN
Procedure HEAPSORT(A, N)
// A is the list of unsorted elements

// N is the number of elements in the array
Repeat For I = N/2 to 2 step -1

WALKDOWN(A, I, N)
End Repeat
Repeat For J = N to 2 Step -1
WALKDOWN(A, 1, J)
A[1] A[J]
End Repeat
End HEAPSORT
The WALKDOWN procedure is used to convert a subtree into a heap. If this algorithm is
applied on a node of the tree, then the subtree starting from that node will be converted
into a max heap. In the above given WALKDOWN algorithm, the element at the given
node is compared with its left and right child and is swapped with the maximum of that.
During this process the element at the given node may walkdown to its correct position in
the subtree. The procedure stops if the element at I reaches a leaf or it reaches its correct
position.
In the HEAPSORT algorithm there are two phases. In the first phase the
walkdown procedure is applied on each node starting from the last node at N/2 to node at
position 2. The root node is not disturbed. During this first phase, the subtrees below the
root satisfy the max heap property.
In the second phase of the sorting algorithm, the walkdown procedure is applied
on the root node. After this pass the entire tree becomes a heap. The root node element
and the last element are swapped and the last element is now not considered for the next
pass. Thus the tree size reduces by one in the next pass. This process is repeated till we
obtain a sorted list.
Example:
Given a list A with 8 elements:
42 23 74 11 65 58 94 36
The given list is first converted into a binary tree as shown.
Binary tree
Phase 1:
The rearranged tree elements after the first phase is
Phase 2:
Program:
void array::sort()
{
int size=count, i=count/2, temp;
while (i>1)
{
walkdown(i,size);
i--;
}
while (size>1)
{
walkdown(1,size);
temp=a[1];
a[1]=a[size];
a[size]=temp;
size--;
}
}
void array::walkdown(int i, int size)

{
int l, r, temp, largest;
while (i<=size/2)
{
l=2*i;
r=2*i+1;
if (a[l]>a[i])
largest=l;
else
largest=i;
if (r<=size && a[r]>a[largest])
largest=r;
if (largest!=i)
{
temp=a[i];
a[i]=a[largest];
a[largest]=temp;
i=largest;
}
else
return;
}
}
HASHING
The implementation of hash tables is frequently called hashing. Hashing is a technique
used for performing insertions, deletions, and finds in constant average time. Tree
operations that require any ordering information among the elements are not supported
efficiently. Thus, operations such as findMin, findMax, and the printing of the entire table
in sorted order in linear time are not supported.
The central data structure in this chapter is the hash table. We will . . .
See several methods of implementing the hash table.
Compare these methods analytically.
Show numerous applications of hashing.
Compare hash tables with binary search trees.
General Idea
The ideal hash table data structure is merely an array of some fixed size containing the
items. Generally a search is performed on some part (that is, data member) of the item.
This is called the key. For instance, an item could consist of a string (that serves as the
key) and additional data members (for instance, a name that is part of a large employee
structure). We will refer to the table size as TableSize, with the understanding that this is
part of a hash data structure and not merely some variable floating around globally. The
common convention is to have the table run from 0 to TableSize 1;
we will see why shortly. Each key is mapped into some number in the range 0 to
TableSize 1 and placed in the appropriate cell. The mapping is called a hash function,
which ideally should be simple to compute and should ensure that any two distinct keys
get different cells. Since there are a finite number of cells and a virtually inexhaustible
supply of keys, this is clearly impossible, and thus we seek a hash function that
distributes the keys evenly among the cells. Figure is typical of a perfect situation. In this
example, john hashes to 3, phil hashes to 4, dave hashes to 6, and mary hashes to 7.
Hash Function
If the input keys are integers, then simply returning Key mod TableSize is generally a
reasonable strategy, unless Key happens to have some undesirable properties. In this case,
the choice of hash function needs to be carefully considered. For instance, if the table size
is 10 and the keys all end in zero, then the standard hash function is a bad choice. For
reasons we shall see later, and to avoid situations like the one above, it is often a good
idea to ensure that the table size is prime. When the input keys are random integers, then
this function is not only very simple to compute but also distributes the keys evenly.
Usually, the keys are strings; in this case, the hash function needs to be chosen
carefully. One option is to add up the ASCII values of the characters in the string. The
routine in Figure implements this strategy.
The hash function depicted in following figure is simple to implement and
computes an answer quickly. However, if the table size is large, the function does not
distribute the keys well. For instance, suppose that TableSize = 10,007 (10,007 is a prime
number). Suppose all the keys are eight or fewer characters long. Since an ASCII
character has an integer value that is always at most 127, the hash function typically can
only assume values between 0 and 1,016, which is 127 8. This is clearly not an
equitable distribution!
Another hash function is shown in Figure. This hash function assumes that Key
has at least three characters. The value 27 represents the number of letters in the English
alphabet, plus the blank, and 729 is 272. This function examines only the first three
characters, but if these are random and the table size is 10,007, as before, then we would
expect a reasonably equitable distribution. Unfortunately, English is not random.
Although there are 263 = 17,576 possible combinations of three characters (ignoring
blanks), a check of a reasonably large online dictionary reveals that the number of
different combinations is actually only 2,851. Even if none of these combinations collide,
only 28 percent of the table can actually be hashed to. Thus this function, although easily
computable, is also not appropriate if the hash table is reasonably large.
The following figure shows a third attempt at a hash function. This hash function
involves
a_ll characters in the key and can generally be expected to distribute well (it computes
KeySize1
i=0 Key[KeySize i 1] 37i and brings the result into proper range). The code
computes a polynomial function (of 37) by use of Horners rule. For instance, another
way
of computing hk = k0 + 37k1 + 372k2 is by the formula hk = ((k2) 37 + k1) 37 + k0.
Horners rule extends this to an nth degree polynomial.
The hash function takes advantage of the fact that overflow is allowed and uses unsigned
int to avoid introducing a negative number. The hash function described in Figure is not
necessarily the best with respect to table distribution, but it does have the merit of
extreme simplicity and is reasonably fast. If the keys are very long, the hash function will
take too long to compute. A common practice in this case is not to use all the characters.
The length and properties of the keys would then influence the choice. For instance, the
keys could be a complete street address. The hash function might include a couple of
characters from the street address and perhaps a couple of characters from the city name
and ZIP code. Some programmers implement their hash function by using only the
characters in the odd spaces, with the idea that the time saved computing the hash
function will make up for a slightly less evenly distributed function.
The main programming detail left is collision resolution. If, when an element is
inserted, it hashes to the same value as an already inserted element, then we have a
collision and need to resolve it. There are several methods for dealing with this. We will
discuss two of the simplest: separate chaining and open addressing; then we will look at
some more recently discovered alternatives.
Separate Chaining
The first strategy, commonly known as separate chaining, is to keep a list of all
elements that hash to the same value. We can use the Standard Library list
implementation. If space is tight, it might be preferable to avoid their use (since these
lists are doubly linked and waste space). We assume for this section that the keys are the
first 10 perfect squares and that the hashing function is simply hash(x) = x mod 10. (The
table size is not prime but is used here for simplicity.) Following figure shows the
resulting separate chaining hash table.
To perform a search, we use the hash function to determine which list to traverse.
We then search the appropriate list. To perform an insert, we check the appropriate list to
see whether the element is already in place (if duplicates are expected, an extra data
member is usually kept, and this data member would be incremented in the event of a
match). If the element turns out to be new, it can be inserted at the front of the list, since
it is convenient and also because frequently it happens that recently inserted elements are
the most likely to be accessed in the near future.
The class interface for a separate chaining implementation is shown in Figure 5.6.
The hash table stores an array of linked lists, which are allocated in the constructor. The
class interface illustrates a syntax point: Prior to C++11, in the declaration of the Lists, a
space was required between the two >s; since >> is a C++ token, and because it is longer
than >, >> would be recognized as the token. In C++11, this is no longer the case. Just as
the binary search tree works only for objects that are Comparable, the hash tables in this
chapter work only for objects that provide a hash function and equality operators
(operator== or operator!=, or possibly both).
Instead of requiring hash functions that take both the object and the table size as
parameters, we have our hash functions take only the object as the parameter and return
an appropriate integral type. The standard mechanism for doing this uses function
objects, and the protocol for hash tables was introduced in C++11. Specifically, in C+
+11, hash functions can be expressed by the function object template:
Rehashing
If the table gets too full, the running time for the operations will start taking too long, and
insertions might fail for open addressing hashing with quadratic resolution. This can
happen if there are too many removals intermixed with insertions. A solution, then, is to
build another table that is about twice as big (with an associated new hash function) and
scan down the entire original hash table, computing the new hash value for each
(nondeleted) element and inserting it in the new table.
As an example, suppose the elements 13, 15, 24, and 6 are inserted into a linear probing
hash table of size 7. The hash function is h(x) = x mod 7. The resulting hash table appears
in Figure 5.19.
If 23 is inserted into the table, the resulting table in Figure 5.20 will be over 70 percent
full. Because the table is so full, a new table is created. The size of this table is 17,
because this is the first prime that is twice as large as the old table size. The new hash
function is then h(x) = x mod 17. The old table is scanned, and elements 6, 15, 23, 24, and
13 are inserted into the new table. The resulting table appears in Figure 5.21.
This entire operation is called rehashing. This is obviously a very expensive operation;
the running time is O(N), since there are N elements to rehash and the table size is
roughly 2N, but it is actually not all that bad, because it happens very infrequently. In
particular, there must have been N/2 insertions prior to the last rehash, so it essentially
adds a constant cost to each insertion. This is why the new table is made twice as large as
the old table. If this data structure is part of the program, the effect is not noticeable. On
the other hand, if the hashing is performed as part of an interactive system, then the
unfortunate user whose insertion caused a rehash could see a slowdown.
Rehashing can be implemented in several ways with quadratic probing. One

alternative is to rehash as soon as the table is half full. The other extreme is to rehash only
when an insertion fails. A third, middle-of-the-road strategy is to rehash when the table
reaches a certain load factor. Since performance does degrade as the load factor increases,
the third strategy, implemented with a good cutoff, could be best. Rehashing for separate
chaining hash tables is similar.
Extendible Hashing
Our last topic in this chapter deals with the case where the amount of data is too large to
fit in main memory. As we saw in Chapter 4, the main consideration then is the number
of disk accesses required to retrieve data.
As before, we assume that at any point we have N records to store; the value of N
changes over time. Furthermore, at most M records fit in one disk block. We will use M =
4 in this section.
If either probing hashing or separate chaining hashing is used, the major problem is that
collisions could cause several blocks to be examined during a search, even for a well-
distributed hash table. Furthermore, when the table gets too full, an extremely expensive
rehashing step must be performed, which requires O(N) disk accesses.
A clever alternative, known as extendible hashing, allows a search to be performed in
two disk accesses. Insertions also require few disk accesses.
We recall from Chapter 4 that a B-tree has depth O(logM/2 N). As M increases, the depth
of a B-tree decreases. We could in theory choose M to be so large that the depth of the B-
tree would be 1. Then any search after the first would take one disk access, since,
presumably, the root node could be stored in main memory. The problem with this
strategy is that the branching factor is so high that it would take considerable processing
to determine which leaf the data was in. If the time to perform this step could be reduced,
then we would have a practical scheme. This is exactly the strategy used by extendible
hashing.
Let us suppose, for the moment, that our data consists of several 6-bit integers.
Figure 5.52 shows an extendible hashing scheme for these data. The root of the tree
contains four pointers determined by the leading two bits of the data. Each leaf has up to
M = 4 elements. It happens that in each leaf the first two bits are identical; this is
indicated by the number in parentheses. To be more formal, D will represent the number
of bits used by the root, which is sometimes known as the directory. The number of
entries in the directory is thus 2D. dL is the number of leading bits that all the elements of
some leaf L have in common. dL will depend on the particular leaf, and dL D.
Suppose that we want to insert the key 100100. This would go into the third leaf, but as
the third leaf is already full, there is no room. We thus split this leaf into two leaves,
which are now determined by the first three bits. This requires increasing the directory
size to 3. These changes are reflected in Figure 5.53.
Notice that all the leaves not involved in the split are now pointed to by two
adjacent directory entries. Thus, although an entire directory is rewritten, none of the
other leaves is actually accessed.
If the key 000000 is now inserted, then the first leaf is split, generating two leaves
with dL = 3. Since D = 3, the only change required in the directory is the updating of the
000 and 001 pointers. See Figure 5.54.
This very simple strategy provides quick access times for insert and search
operations on large databases. There are a few important details we have not considered.
First, it is possible that several directory splits will be required if the elements in a leaf
agree in more than D + 1 leading bits. For instance, starting at the original example, with
D = 2, if 111010, 111011, and finally 111100 are inserted, the directory size must be
increased to 4 to distinguish between the five keys. This is an easy detail to take care of,
but must not be forgotten. Second, there is the possibility of duplicate keys; if there are
more than M duplicates, then this algorithm does not work at all. In this case, some other
arrangements need to be made.
These possibilities suggest that it is important for the bits to be fairly random. This can be
accomplished by hashing the keys into a reasonably long integerhence the name. We
close by mentioning some of the performance properties of extendible hashing, which are
derived after a very difficult analysis. These results are based on the reasonable
assumption that the bit patterns are uniformly distributed.
The expected number of leaves is (N/M) log2 e. Thus the average leaf is ln 2 = 0.69 full.
This is the same as for B-trees, which is not entirely surprising, since for both data
structures new nodes are created when the (M + 1)th entry is added.
The more surprising result is that the expected size of the directory. If M is very small,
then the directory can get unduly large. In this case, we can have the leaves contain
pointers to the records instead of the actual records, thus increasing the value of M. This
adds a second disk access to each search operation in order to maintain a smaller
directory. If the directory is too large to fit in main memory, the second disk access would
be needed anyway.

Sorting and Hashing

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Sorting and Hashing

Caricato da

Copyright:

Formati disponibili

SORTING AND HASHING

Sorting by Selection- Sorting by Insertion- Sorting by Exchange- Sorting by Diminishing

Practically, all data processing activities require data to be in some order.

Sorting: Sorting is an operation of arranging data, in some given order, such as

Sorting methods can be characterized into two broad categories:

The various internal sorting methods are:

Criteria for the selection of a sorting method.

1. Programming time of the sorting algorithm

Objectives involved in design of sorting algorithms.

The main objectives involved in the design of sorting algorithm are:

1. Minimum number of exchanges

// A is the array containing the list of data items

Repeat For Pass = 1 to Last Step 1

N = 10 Number of elements in the list

i=0 i =1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9

// A is the array containing the list of data items

While Last > 0

Repeat For I = 0 to Last Step 1

i=0 i =1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9

for (int i=0; i<10; i++)

DIMINISHING INCREMENT SORT

ALGORITHM FOR SHELL SORT

// A is the array containing the list of data items

While A[J] > A[Lower]

i=0 i =1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9

void array::sort(int lower, int upper)

ANALYSIS OF QUICK SORT

// A is the list of unsorted elements

// A is the list of unsorted elements

Repeat For I = N/2 to 2 step -1

Given a list A with 8 elements:

void array::walkdown(int i, int size)

Rehashing can be implemented in several ways with quadratic probing. One

Potrebbero piacerti anche