COMP211slides 11

Hashing
Data organization in main memory or disk

sequential, binary trees,
The location of a key depends on other
keys => unnecessary key comparisons to

find a key
Question: find key with a single comparison
Hashing: the location of a record is
computed using its key only
Fast for random accesses - slow for range
queries
E.G.M. Petrakis
Hashing
Hash Table
Hash Function: transforms keys to
array indices
0
1
2
3
4
h(key)
index
data
.
.
.
n
E.G.M. Petrakis
h(key): Hash
Hashing
Function
h(key) = key mod 1000
E.G.M. Petrakis
position
key
0
1
2
3
.
.
.
395
396
397
398
399
400
401
.
.
.
990
991
992
993
994
995
996
997
998
999
4967000
record
8421002
.
.
.
4618396
4957397
1286399
.
.
.
0000990
0000991
1200992
0047993
9846995
4618996
4967997
0001999
Hashing
Good Hash Functions

1. Uniform: distribute keys evenly in space
2. Perfect: two records cannot occupy the
same location or ki k j , i j : h(ki ) h(k j )
3. Order preserving:ki k j , i j : h(ki ) h(k j )
Difficult to find such hash functions
Property 2 is the most essential
Most functions are no better than
h(key) = key mod m
k i k j : h(k i ) = h(k j )
Hash collision:
E.G.M. Petrakis
Hashing
Collision Resolution
1. Open Addressing (rehashing):
compute new position to store the

key in the table (no extra space)
i. linear probing
ii. double hashing
2. Separate Chaining: lists of keys
mapped to the same position (uses
extra space)
E.G.M. Petrakis
Hashing
Open Addressing
Computes a new address to store the key
if it is occupied (rehashing)
if occupied too, compute a new address,

until an empty position is found
primary hash function: i=h(key)
rehash function: rh(i)=rh(h(key))
hash sequence: (h0,h1,h2) = (h(key),
rh(h(key)), rh(rh(h(key))))
To find a key follow the same hash

sequence
E.G.M. Petrakis
Hashing
Example
i=h(key)=key mod 100
rh(i) = (i+1) mod 100
key: 193
i=h(193)=93
rh(i)=(93+1)=94
Key 193 will occupy
position 94
E.G.M. Petrakis
Hashing
0
1
2
.
.
.
90
91
92
93
94
.
.
.
100
100
101
.
.
.
990
991
992
993
193
.
.
.
7
Problem 1: Locate Empty

Positions
i.
No empty position can be found
the table is full

check on number of empty positions
ii. the hash function fails to find an empty
position although the table is not full !!
i=h(key) = key mod 1000
rh(i) = (i + 200) mod 1000 => checks only 5
positions on a table of 1000 positions
rh(i) = (i+1) mod 1000 successive positions
rh(i) = (i+c) mod 1000 where GCD(c,m) = 1
E.G.M. Petrakis
Hashing
Problem 2: Primary
Clustering
Different keys that hash
into different addresses
compete with each other
in successive rehashes

rh(i) = (i+1) mod 100
keys: 1990, 1991, 1992,
1993, 1994 => 94
E.G.M. Petrakis
Hashing
0
1
2
.
.
.
90
91
92
93
94
.
.
.
100
100
101
.
.
.
990
991
992
993
.
.
.
9
Problem 3: Secondary
Clustering
Different keys which

hash to the same hash
value have the same
rehash sequence

rh(i,j) = (i + j) mod 10
i. key 23 : h(23) = 3
rh = 4, 6, 9, 3,
ii. key 13 : h(13) = 3
rh = 4, 6, 9, 3,
E.G.M. Petrakis
Hashing
0
1
2
3
4
5
6
7
8
9
10
53
14
15
46
10
Linear Probing
Store the key into the next free
position
h0 = h(key) usually h0 = key mod m
hi = (hi-1 + 1) mod m, i >= 1
E.G.M. Petrakis
0
1
2
3
4
5
6
7
8
9
301
22
102
452
35
99
S = {22, 35, 301, 99, 102, 452}
Hashing
11
Observation 1
Different insertion
sequences => different

hash sequences
S1={11,3,27,99,8,50,77,2
2,12,31,33,40,53}=>28
probes
S2={53,40,33,31,12,22,7
7,50,8,99,27,3,11}=> 30
probes
E.G.M. Petrakis
Hashing
0
1
2
3
4
5
6
7
8
9
10
11
12
number
17
27
12
3
40
31
53
33
99
8
22
11
50
2
1
4
1
4
1
6
1
1
2
2
1
2
of
probes
H(key) = key mod 13

12
Observation 2
0
Deletions are not easy:

rh(i) = (i+1) mod 10
Action: delete(65) and search(5)

Problem: search will stop at the
empty position and will not find 5

Solution:
70
1
2
12
33
14
55
65
75
85
mark position as deleted rather than empty

the marked position can be reused
E.G.M. Petrakis
Hashing
13
Observation 3
Linear probing tends
to create long
sequences of occupied
positions
m
P =
B +1
m
the longer a sequence
is, the longer it tends

to become
P: probability to use a
position in the cluster
E.G.M. Petrakis
Hashing
14
Observation 4
Linear probing suffers from both
primary and secondary clustering
Solution: double hashing
uses two hash functions h1, h2 and a
rehashing function rh
E.G.M. Petrakis
Hashing
15
Double Hashing
Two hash functions and a rehashing
function
primary hash function i=h1(key)= key mod m

secondary hash function h2(key)
rehashing function: rh(key) = (i + h2(key)) mod m
h2(m,key) is some function of m, key
helps rh in computing random positions in the

hash table
h2 is computed once for each key!
E.G.M. Petrakis
Hashing
16
Example of Double Hashing

i. hash function:
h1(key) = key mod m
m div 2
h2 (key) =
q
q =0
q 0
q = (key div m) mod m

ii. rehash function:
rh(i, key) = (i + h2(key)) mod m
E.G.M. Petrakis
Hashing
17
Example (continued)
A. m = 10, key = 23
h1(23) = 3, h2(23) = 2
rh(3,2)=(3+2) mod 10 = 5
rehash sequence: 5, 7, 9, 1,
m = 10, key = 13
h1(key)=3, h2(13)=1, rh(3,1)=(3+1)mod10=4
rehash sequence: 4, 5, 6,
E.G.M. Petrakis
Hashing
18
Performance of Open
Addressing
Distinguish between
successful and
unsuccessful search
Assume a series of probes to random

positions
independent events
load factor: = n/m
: probability to probe an occupied position
each position has the same probability P=1/m
E.G.M. Petrakis
Hashing
19
Unsuccessful Search
The hash sequence is exhausted
let u be the expected number of probes
u equals the expected length of the hash
sequence
P(k): probability to search k positions in
the hash sequence
E.G.M. Petrakis
Hashing
20
u =
kP(k)
k 1
P(1) +
P(2) + P(2) +
P(3) + P(3) + P(3) +
L
P(k) + P(k) + L + P(k) +
__________ __________ ____

P( 1probes) + P( 2probes) + L
E.G.M. Petrakis
Hashing
21
u = P( k probes) =
k 1
P(first k 1 positions ocupied ) =
k 1
1
k
k 1
u=
E.G.M. Petrakis
k 1
independent events
u increases with =>
performance drops as
increases
Hashing
22
Successful Search
The hash sequence is not exhausted
the number of probes to find a key
equals the number of probes s at the

time the key was inserted plus 1
was less at that time u: equivalent to
consider all values of unsuccessful search
1
1
1
s = (u + 1)dx = 1 + ln(
)
1
0
E.G.M. Petrakis
approximation
Hashing
increases with
23
Performance
The performance drops as increases
the higher the value of is, the higher
the probability of collisions
Unsuccessful search is more
expensive than successful search

unsuccessful search exhausts the hash
sequence
E.G.M. Petrakis
Hashing
24
Experimental Results
LOAD
FACTOR
SUCCESSFUL
LINEAR
UNSUCCESSFUL
i + bkey DOUBLE
LINEAR
i + bkey DOUBLE
25%
1.17
1.16
1.15
1.39
1.37
1.33
50%
1.50
1.44
1.39
2.50
2.19
2.00
75%
2.50
2.01
1.85
8.50
4.64
4.00
90%
5.50
2.85
2.56
50.50
11.40
10.00
95%
10.50
3.52
3.15
200.50
22.04
20.00
E.G.M. Petrakis
Hashing
25
Performance on Full Table

TABLE
SIZE (m)
SUCCESSFUL
LINEAR i + bkey
UNSUCCESSFUL
LOG2m
DOUBLE
100
6.60
4.62
4.12
50.50
6.64
500
14.35
6.22
5.72
250.50
8.97
1000
20.15
6.91
6.41
500.50
9.97
5000
44.64
8.52
8.02
2500.5
12.29
10000
63.00
9.21
8.71
5000.50
13.29
E.G.M. Petrakis
Hashing
26
Separate Chaining
Keys hashing to the same hash value
are stored in separate lists
one list per hash position
can store more than m records
easy to implement
the keys in each list can be ordered
E.G.M. Petrakis
Hashing
27
40
91
42
nil
nil
130
nil
h(key) = key mod m

372
192
nil
nil
75
66
16
87
67
nil
nil
417
227
nil
nil
E.G.M. Petrakis
49
nil
Hashing
28
Performance of Separate
Chaining
Depends on the average chain size
insertions are independent events
let P(c,n,m): probability that a position
has been selected c times after n

insertions on a table of size m
P(c,n,m): probability that the chain has
length c => binomial distribution
n c n c p=1/m: success case
P(c, n, m) = p q
q=1-p: failure case
c
E.G.M. Petrakis
Hashing
29
n c
n 1
1
P(c, n, m) = 1
m
c m
=
c
1 n n c +1
1
1
1 1
m m m
c! m
n c +1
n, m
=>
1
m
=>
1 e
m
E.G.M. Petrakis
Hashing
P(c,n,m)=(1/c!)ce-
Poison
30
Unsuccessful Search
The entire chain is searched
the average number of comparisons
equals its average length u

u = cP(c, ) = c e =
c 0
c 0 c!
E.G.M. Petrakis
Hashing
31
Successful Search
Not the whole chain is searched
the average number of comparisons
equals the length s of the chain at time

the key was inserted plus 1
the performance at the time a key was
inserted equals that of unsuccessful
search!
1
1
s = (u + 1)dx = (x + 1)dx = 1 +
0
2
0
E.G.M. Petrakis
Hashing
32
Performance
The performance drops with the
length of the chains

worst case: all keys are stored in a single
chain
worst case performance: O(N)
unsuccessful search performs better
than successful search!! WHY ?
no problem with deletions!!
E.G.M. Petrakis
Hashing
33
Coalesced Hashing
The hash sequence is
implemented as a
linked list within the
hash table
no rehash function
the next hash position
is the next available

position in linked list
extra space for the list
E.G.M. Petrakis
Hashing
h(key) = key mod 10

keys: 19, 29, 49, 59
0
1
2
3
4
5
6
7
8
9
49
29
59
19
5
34
initialization
avail
0
1
2
3
4
5
6
7
8
9
nilkey
nilkey
.
.
.
.
.
.
.
nilkey
-1
0
1
2
3
4
5
6
7
8
List of
empty positions
E.G.M. Petrakis
initially: avail = 9
h(key) = key mod 10
keys:
14,29,34,28,42,39,84,38
0
1
2
3
4
5
6
7
8
9
Hashing
nilkey
nilkey
42
38
14
84
39
28
34
29
-1
0
-1
-1
8
-1
-1
3
5
6
Holds lists of
rehashing
positions and
list of empty
positions
35
Performance of Coalesced
Hashing
Unsuccessful search
1 2
e + 0.75 probes/search
4
2
Successful search
e 2 1
+ + 0.75
8
4
E.G.M. Petrakis
probes/search
Hashing
36

COMP211slides 11

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

COMP211slides 11

Caricato da

Copyright:

Formati disponibili

Hashing

Data organization in main memory or disk

The location of a key depends on other

keys => unnecessary key comparisons to

h(key) = key mod 1000

Good Hash Functions

compute new position to store the

if occupied too, compute a new address,

To find a key follow the same hash

Problem 1: Locate Empty

No empty position can be found

the table is full

i=h(key) = key mod 100

Different keys which

i=h(key) = key mod 10

S = {22, 35, 301, 99, 102, 452}

sequences => different

H(key) = key mod 13

Deletions are not easy:

Action: delete(65) and search(5)

empty position and will not find 5

mark position as deleted rather than empty

the longer a sequence

is, the longer it tends

primary hash function i=h1(key)= key mod m

h2(m,key) is some function of m, key

helps rh in computing random positions in the

Example of Double Hashing

h1(key) = key mod m

q = (key div m) mod m

Assume a series of probes to random

P(k) + P(k) + L + P(k) +

__________ __________ ____

P(first k 1 positions ocupied ) =

equals the number of probes s at the

Unsuccessful search is more

expensive than successful search

Performance on Full Table

h(key) = key mod m

has been selected c times after n

equals the length s of the chain at time

length of the chains

is the next available

h(key) = key mod 10

Potrebbero piacerti anche

____