Sei sulla pagina 1di 36

Hashing

Data organization in main memory or disk


sequential, binary trees,

The location of a key depends on other

keys => unnecessary key comparisons to


find a key
Question: find key with a single comparison
Hashing: the location of a record is
computed using its key only
Fast for random accesses - slow for range
queries

E.G.M. Petrakis

Hashing

Hash Table
Hash Function: transforms keys to
array indices
0
1
2
3
4

h(key)

index

data

.
.
.
n
E.G.M. Petrakis

h(key): Hash
Hashing
Function

h(key) = key mod 1000

E.G.M. Petrakis

position

key

0
1
2
3
.
.
.
395
396
397
398
399
400
401
.
.
.
990
991
992
993
994
995
996
997
998
999

4967000

record

8421002
.
.
.
4618396
4957397
1286399

.
.
.
0000990
0000991
1200992
0047993
9846995
4618996
4967997
0001999
Hashing

Good Hash Functions


1. Uniform: distribute keys evenly in space
2. Perfect: two records cannot occupy the
same location or ki k j , i j : h(ki ) h(k j )
3. Order preserving:ki k j , i j : h(ki ) h(k j )
Difficult to find such hash functions
Property 2 is the most essential
Most functions are no better than
h(key) = key mod m
k i k j : h(k i ) = h(k j )
Hash collision:
E.G.M. Petrakis

Hashing

Collision Resolution
1. Open Addressing (rehashing):

compute new position to store the


key in the table (no extra space)
i. linear probing
ii. double hashing
2. Separate Chaining: lists of keys
mapped to the same position (uses
extra space)
E.G.M. Petrakis

Hashing

Open Addressing
Computes a new address to store the key
if it is occupied (rehashing)

if occupied too, compute a new address,


until an empty position is found
primary hash function: i=h(key)
rehash function: rh(i)=rh(h(key))
hash sequence: (h0,h1,h2) = (h(key),
rh(h(key)), rh(rh(h(key))))

To find a key follow the same hash


sequence

E.G.M. Petrakis

Hashing

Example
i=h(key)=key mod 100
rh(i) = (i+1) mod 100

key: 193
i=h(193)=93
rh(i)=(93+1)=94
Key 193 will occupy
position 94

E.G.M. Petrakis

Hashing

0
1
2
.
.
.
90
91
92
93
94
.
.
.
100

100
101
.
.
.
990
991
992
993

193

.
.
.
7

Problem 1: Locate Empty


Positions

i.

No empty position can be found

the table is full


check on number of empty positions
ii. the hash function fails to find an empty
position although the table is not full !!
i=h(key) = key mod 1000
rh(i) = (i + 200) mod 1000 => checks only 5
positions on a table of 1000 positions
rh(i) = (i+1) mod 1000 successive positions
rh(i) = (i+c) mod 1000 where GCD(c,m) = 1

E.G.M. Petrakis

Hashing

Problem 2: Primary
Clustering
Different keys that hash
into different addresses
compete with each other
in successive rehashes

i=h(key) = key mod 100


rh(i) = (i+1) mod 100
keys: 1990, 1991, 1992,
1993, 1994 => 94

E.G.M. Petrakis

Hashing

0
1
2
.
.
.
90
91
92
93
94
.
.
.
100

100
101
.
.
.
990
991
992
993
.
.
.
9

Problem 3: Secondary
Clustering

Different keys which


hash to the same hash
value have the same
rehash sequence

i=h(key) = key mod 10


rh(i,j) = (i + j) mod 10
i. key 23 : h(23) = 3

rh = 4, 6, 9, 3,
ii. key 13 : h(13) = 3
rh = 4, 6, 9, 3,
E.G.M. Petrakis

Hashing

0
1
2
3
4
5
6
7
8
9

10

53
14
15
46

10

Linear Probing
Store the key into the next free
position
h0 = h(key) usually h0 = key mod m
hi = (hi-1 + 1) mod m, i >= 1

E.G.M. Petrakis

0
1
2
3
4
5
6
7
8
9

301
22
102
452
35

99

S = {22, 35, 301, 99, 102, 452}

Hashing

11

Observation 1
Different insertion

sequences => different


hash sequences
S1={11,3,27,99,8,50,77,2
2,12,31,33,40,53}=>28

probes

S2={53,40,33,31,12,22,7
7,50,8,99,27,3,11}=> 30
probes
E.G.M. Petrakis

Hashing

0
1
2
3
4
5
6
7
8
9
10
11
12

number
17
27
12
3
40
31
53
33
99
8
22
11
50

2
1
4
1
4
1
6
1
1
2
2
1
2

of
probes

H(key) = key mod 13


12

Observation 2
0

Deletions are not easy:


i=h(key) = key mod 10
rh(i) = (i+1) mod 10

Action: delete(65) and search(5)


Problem: search will stop at the

empty position and will not find 5


Solution:

70

1
2

12

33

14

55

65

75

85

mark position as deleted rather than empty


the marked position can be reused

E.G.M. Petrakis

Hashing

13

Observation 3
Linear probing tends

to create long
sequences of occupied
positions
m

P =

B +1
m

the longer a sequence

is, the longer it tends


to become
P: probability to use a
position in the cluster
E.G.M. Petrakis

Hashing

14

Observation 4
Linear probing suffers from both
primary and secondary clustering
Solution: double hashing
uses two hash functions h1, h2 and a
rehashing function rh

E.G.M. Petrakis

Hashing

15

Double Hashing
Two hash functions and a rehashing
function

primary hash function i=h1(key)= key mod m


secondary hash function h2(key)
rehashing function: rh(key) = (i + h2(key)) mod m

h2(m,key) is some function of m, key

helps rh in computing random positions in the


hash table
h2 is computed once for each key!

E.G.M. Petrakis

Hashing

16

Example of Double Hashing


i. hash function:

h1(key) = key mod m

m div 2
h2 (key) =
q

q =0
q 0

q = (key div m) mod m


ii. rehash function:
rh(i, key) = (i + h2(key)) mod m

E.G.M. Petrakis

Hashing

17

Example (continued)
A. m = 10, key = 23
h1(23) = 3, h2(23) = 2
rh(3,2)=(3+2) mod 10 = 5
rehash sequence: 5, 7, 9, 1,
m = 10, key = 13
h1(key)=3, h2(13)=1, rh(3,1)=(3+1)mod10=4
rehash sequence: 4, 5, 6,
E.G.M. Petrakis

Hashing

18

Performance of Open
Addressing
Distinguish between
successful and
unsuccessful search

Assume a series of probes to random


positions

independent events
load factor: = n/m
: probability to probe an occupied position
each position has the same probability P=1/m

E.G.M. Petrakis

Hashing

19

Unsuccessful Search
The hash sequence is exhausted
let u be the expected number of probes
u equals the expected length of the hash
sequence
P(k): probability to search k positions in
the hash sequence

E.G.M. Petrakis

Hashing

20

u =

kP(k)

k 1

P(1) +
P(2) + P(2) +
P(3) + P(3) + P(3) +
L

P(k) + P(k) + L + P(k) +

__________ __________ ____


P( 1probes) + P( 2probes) + L

E.G.M. Petrakis

Hashing

21

u = P( k probes) =
k 1

P(first k 1 positions ocupied ) =

k 1

1
k

k 1

u=

E.G.M. Petrakis

k 1

independent events
u increases with =>
performance drops as
increases

Hashing

22

Successful Search
The hash sequence is not exhausted
the number of probes to find a key

equals the number of probes s at the


time the key was inserted plus 1
was less at that time u: equivalent to
consider all values of unsuccessful search

1
1
1
s = (u + 1)dx = 1 + ln(
)

1
0
E.G.M. Petrakis

approximation
Hashing

increases with

23

Performance
The performance drops as increases
the higher the value of is, the higher
the probability of collisions

Unsuccessful search is more

expensive than successful search


unsuccessful search exhausts the hash
sequence

E.G.M. Petrakis

Hashing

24

Experimental Results
LOAD
FACTOR

SUCCESSFUL

LINEAR

UNSUCCESSFUL

i + bkey DOUBLE

LINEAR

i + bkey DOUBLE

25%

1.17

1.16

1.15

1.39

1.37

1.33

50%

1.50

1.44

1.39

2.50

2.19

2.00

75%

2.50

2.01

1.85

8.50

4.64

4.00

90%

5.50

2.85

2.56

50.50

11.40

10.00

95%

10.50

3.52

3.15

200.50

22.04

20.00

E.G.M. Petrakis

Hashing

25

Performance on Full Table


TABLE
SIZE (m)

SUCCESSFUL

LINEAR i + bkey

UNSUCCESSFUL

LOG2m

DOUBLE

100

6.60

4.62

4.12

50.50

6.64

500

14.35

6.22

5.72

250.50

8.97

1000

20.15

6.91

6.41

500.50

9.97

5000

44.64

8.52

8.02

2500.5

12.29

10000

63.00

9.21

8.71

5000.50

13.29

E.G.M. Petrakis

Hashing

26

Separate Chaining
Keys hashing to the same hash value
are stored in separate lists
one list per hash position
can store more than m records
easy to implement
the keys in each list can be ordered

E.G.M. Petrakis

Hashing

27

40

91

42

nil

nil

130

nil

h(key) = key mod m


372

192

nil

nil

75

66

16

87

67

nil

nil
417

227

nil

nil

E.G.M. Petrakis

49

nil
Hashing

28

Performance of Separate
Chaining
Depends on the average chain size
insertions are independent events
let P(c,n,m): probability that a position

has been selected c times after n


insertions on a table of size m
P(c,n,m): probability that the chain has
length c => binomial distribution
n c n c p=1/m: success case
P(c, n, m) = p q
q=1-p: failure case
c

E.G.M. Petrakis

Hashing

29

n c

n 1
1

P(c, n, m) = 1
m
c m

=
c

1 n n c +1
1
1
1 1
m m m
c! m
n c +1

n, m

=>

1
m

=>

1 e
m

E.G.M. Petrakis

Hashing

P(c,n,m)=(1/c!)ce-
Poison

30

Unsuccessful Search
The entire chain is searched
the average number of comparisons
equals its average length u


u = cP(c, ) = c e =
c 0
c 0 c!

E.G.M. Petrakis

Hashing

31

Successful Search
Not the whole chain is searched
the average number of comparisons

equals the length s of the chain at time


the key was inserted plus 1
the performance at the time a key was
inserted equals that of unsuccessful
search!

1
1

s = (u + 1)dx = (x + 1)dx = 1 +

0
2
0
E.G.M. Petrakis

Hashing

32

Performance
The performance drops with the

length of the chains


worst case: all keys are stored in a single
chain
worst case performance: O(N)
unsuccessful search performs better
than successful search!! WHY ?
no problem with deletions!!

E.G.M. Petrakis

Hashing

33

Coalesced Hashing
The hash sequence is
implemented as a
linked list within the
hash table

no rehash function
the next hash position

is the next available


position in linked list
extra space for the list
E.G.M. Petrakis

Hashing

h(key) = key mod 10


keys: 19, 29, 49, 59
0
1
2
3
4
5
6
7
8
9

49

29

59

19

5
34

initialization

avail

0
1
2
3
4
5
6
7
8
9

nilkey
nilkey
.
.
.
.
.
.
.
nilkey

-1
0
1
2
3
4
5
6
7
8

List of
empty positions
E.G.M. Petrakis

initially: avail = 9
h(key) = key mod 10
keys:
14,29,34,28,42,39,84,38
0
1
2
3
4
5
6
7
8
9
Hashing

nilkey
nilkey
42
38
14
84
39
28
34
29

-1
0
-1
-1
8
-1
-1
3
5
6

Holds lists of
rehashing
positions and
list of empty
positions
35

Performance of Coalesced
Hashing
Unsuccessful search
1 2
e + 0.75 probes/search
4
2

Successful search
e 2 1
+ + 0.75
8
4

E.G.M. Petrakis

probes/search

Hashing

36

Potrebbero piacerti anche