Sei sulla pagina 1di 45

Hashing on the Disk

Keys are stored in disk pages


(buckets)

several records fit within one page

Retrieval:
find address of page
bring page into main memory
searching within the page comes for
free

E.G.M. Petrakis

Hashing

data pages

key
space

0
1

hash
function

2
.
.
.
.

m-1

page size b: maximum number of records in page

space utilization u: measure of the use of space


u=
E.G.M. Petrakis

# stored records
# pages b
Hashing

Collisions
Keys that hash to the same address

are stored within the same page


If the page is full:

i. page splits: allocate a new page and


ii.

split page content between the old and


the new page or
overflows: list of overflow pages
xxxx

E.G.M. Petrakis

Hashing

overflow

xx

Access Time
Goal: find key in one disk access
Access time ~ number of accesses
Large u: good space utilization but

many overflows or splits => more disk


accesses
Non-uniform key distribution => many
keys map to the same addresses =>
overflows or splits => more accesses
E.G.M. Petrakis

Hashing

Categories of Methods
Static: require file reorganization
open addressing, separate chaining

Dynamic: dynamic file growth, adapt


to file size

dynamic hashing,
extendible hashing,
linear hashing,
spiral storage

E.G.M. Petrakis

Hashing

Dynamic Hashing Schemes


File size adapts to data size without

total reorganization
Typically 1-3 disk accesses to access
a key
Access time and u are a typical tradeoff
u between 50-100% (typically 69%)
Complicated implementation

E.G.M. Petrakis

Hashing

Schemes With Index


Two disk accesses:
one to access the index, one to access the data
with index in main memory => one disk access
Problem: the index may become too large
index data pages

Dynamic hashing (Larson 1978)


Extendible hashing (Fagin et.al. 1979)
E.G.M. Petrakis

Hashing

Schemes Without Index


Ideally, less space and less disk accesses
(at least one)

address space
data space

Linear Hashing (Litwin 1980)


Linear Hashing with Partial Expansions

(Larson 1980)
Spiral Storage (Martin 1979)

E.G.M. Petrakis

Hashing

Hash Functions
Support for shrinking or growing file
shrinking or growing address space, the

hash function adapts to these changes


hash functions using first (last) bits of
key = bn-1bn-2.bi b i-1b2b1b0
hi(key)=bi-1b2b1b0 supports 2i addresses
hi: one more bit than hi-1 to address
larger files
hi 1 (key)

hi (key) =
i
+
h
(key)
2
i 1

E.G.M. Petrakis

Hashing

Dynamic Hashing (Larson 1978)


Two level index

primary h1(key): accesses a hash table


secondary h2(key): accesses a binary
tree Index: binary tree

1
2
3
4

h1(k)
E.G.M. Petrakis st
1 level

h2(k)
Hashing
2nd
level

data pages

10

Index
Fixed (static): h1(key) = key mod m
Dynamic behavior on secondary index
h2(key) uses i bits of key
the bit sequence of h2=bi-1b2b1b0

denotes which path on the binary tree


index to follow in order to access the
data page
scan h2 from left to right (bit 1: follow
right path, bit 0: follow left path)
E.G.M. Petrakis

Hashing

11

0
1
2
3
4
5
h1(k)
1st level

index

1
0

h1=1, h2=0
h1=1, h2=01
h1=1, h2=11
h1=5, h2= any

4
b
data pages

h2(k)
2nd level

h1(key) = key mod 6


h2(key) = 10 <= depth of binary tree = 2
E.G.M. Petrakis

Hashing

12

Insertions
Initially fixed size primary bindex and
no data

0
1
2
3

0
1
2
3

h1=1,h2=any

insert record in new page under h1address


if page is full, allocate one extra page
split keys between old and new page
use one extra bit in h2 for addressing
E.G.M. Petrakis

0
1
2
3

0
1
Hashing

h1=1, h2=0
h1=1, h2=1
13

0
1
2
3

0
1
2
3

0
1
2
3
0
1
2
3
E.G.M. Petrakis

h1=0, h2=any

b
2
index
1

0
1
0
1

0
1

storage
1

h1=0, h2=0

h1=0, h2=1

h1=3, h2=any

1
3
4
2

h1=0, h2=0

5
Hashing

h1=3, h2=any

h1=0, h2=01
h1=0, h2=11
h1=3, h2=0
h1=3, h2=1
14

Deletions
Find record to be deleted using h1, h2
Delete record
Check sibling page:

less than b records in both pages ?


if yes merge the two pages
delete one empty page
shrink binary tree index by one level and
reduce h2 by one bit

E.G.M. Petrakis

Hashing

15

0
1
2
3

merging

2
3
4

0
1
2
3

delete

3
4

E.G.M. Petrakis

Hashing

16

Extendible Hashing (Fagin et.al. 1979)

Dynamic hashing without index


Primary hashing is omitted
Only secondary hashing with all binary
trees at the same level
The index shrinks and grows
according to file size
Data pages attached to the index
E.G.M. Petrakis

Hashing

17

0
1
2
3
4
0
1
2
3
4

dynamic
hashing

0
1

0
1

00

01

10

11

E.G.M. Petrakis

Hashing

dynamic
hashing with
all binary trees
at same level
number of
address bits

18

Insertions
Initially 1 index and 1 data page
0 address bits
insert records in data page
index
global depth d:
size of index 2d

storage

local depth l :
Number of address bits

b
E.G.M. Petrakis

Hashing

19

Page 0 Overflows
d

index

storage

d: global depth = 1
l : local depth = 1

0
1

E.G.M. Petrakis

Hashing

20

Page 0 Overflows (cont.)


1 more key bit for addressing and 1 extra

page => index doubles !!


Split contents of previous page between 2
pages according to next bit of key
Global depth d: number of index bits => 2d
index size
Local depth l : number of bits for record
addressing

E.G.M. Petrakis

Hashing

21

Page 0 Overflows (again)


contain records
with same 2 bits of key

d
00
01
10
11

2
2

l d

contains records
with same 1st bit of key

E.G.M. Petrakis

Hashing

22

Page 01 Overflows
d

000
001
010
011
100
101
110
111

E.G.M. Petrakis

2
3
3

Hashing

1 more key bit


for addressing
2d-l: number of
pointers to page

23

Page 100 Overflows


2
000
001
010
011
100
101
110
111

3
3
2
2

+1

no need to double index


page 100 splits into two (1 new page)
local depth l is increased by 1
E.G.M. Petrakis

Hashing

24

Insertion Algorithm
If l < d, split overflowed page (1 extra

page)
If l = d double index, split page and
d is increased by 1=>1 more bit for addressing
update pointers (either way):
a) if d prefix bits are used for addressing
d=d+1;

for (i=2d-1, i>=0,i--) index[i]=index[i/2];


b) if d suffix bits are used
for (i=0; i <= 2d-1; i++) index[i]=index[i]+2d-1;
d=d+1
E.G.M. Petrakis

Hashing

25

Deletion Algorithm
Find and delete record
Check sibling page
If less than b records in both pages
merge pages and free empty page
decrease local depth l by 1 (records in

merged page have 1 less common bit)


if l < d everywhere => reduce index (half
size)
update pointers
E.G.M. Petrakis

Hashing

26

000
001
010
011
100
101
110
111

000
001
010
011
100
101
110
111

E.G.M. Petrakis

delete with
merging

3
2
2

2
2

l<d

2
2

00
01
10
11

2
2
2

Hashing

27

Observations
A page splits and there are more than b

keys with same next bit


take one more bit for addressing (increase l)
if d=l the index doubles again !!
Hashing might fail for non-uniform
distributions of keys (e.g., multiple keys
with same value)
if distribution is known, transform it to uniform
Dynamic hashing performs better for nonuniform distributions (affected locally)

E.G.M. Petrakis

Hashing

28

Performance
For n: records and page size b
expected size of index (Flajolet)
1
(1 + )
n b

1
(1 + )
n b

l
3.92

blog2
b
1 disk access/retrieval when index in
main memory
2 disk accesses when index is on disk
overflows increase number of disk
accesses
E.G.M. Petrakis

Hashing

29

Storage Utilization with Page


Splitting
b

b
before splitting

b
u =
= 50%
2b

after splitting
After splitting

In general 50% < u < 100%


On the average u ~ ln2 ~ 69% (no overflows)
E.G.M. Petrakis

Hashing

30

Storage Utilization with


Overflows
Achieves higher u and avoids page doubling (d=l)

higher u is achieved for small overflow pages

u=2b/3b~66% after splitting


small overflow pages (e.g., b/2) => u = (b+b/2)/2b ~ 75%
double index only if the overflow overflows!!

E.G.M. Petrakis

Hashing

31

Linear Hashing (Litwin 1980)


Dynamic scheme without index
Indices refer to page addresses
Overflows are allowed
The file grows one page at a time
The page which splits is not always

the one which overflowed


The pages split in a predetermined
order
E.G.M. Petrakis
Hashing

32

Linear Hashing (cont.)


Initially n empty pages
p points to the page that splits
p

Overflows are allowed


p

E.G.M. Petrakis

Hashing

33

File Growing
A page splits whenever the splitting
criterion is satisfied

a page is added at the end of the file


pointer p points to the next page
split contents of old page between old
and new page based on key values
p
E.G.M. Petrakis

Hashing

34

125
320
90
435

16
711

402
27
737
712

613
303

4
319

215

522

u=

17
> 80% split
22

438

new element

b=bpage=4, boverflow=1
initially n=5 pages
hash function h0=k mod 5
splitting criterion u > A%
alternatively split when overflow overflows,
etc.
E.G.M. Petrakis
Hashing
35

613
303
438

4
319

125
435
215

h0

h0

h1

320
90

16
711

402
27
737
712

h1

h0

h0

522

18
u=
< 80%
25

Page 5 is added at end of file


The contents of page 0 are split between

pages 0 and 5 based on hash function h1 =


key mod 10
p points to the next page

E.G.M. Petrakis

Hashing

36

Hash Functions
Initially h0=key mod n
As new pages are added at end of file, h0
alone becomes insufficient
The file will eventually double its size
In that case use h1=key mod 2n
In the meantime
use h0 for pages not yet split
use h1 for pages that have already split
Split contents of page pointed to by p
based
E.G.M.
Petrakis on h1
Hashing

37

Hash Functions (cont.)


When the file has doubled its size, h0
is no longer needed

set h0=h1 and continue (e.g., h0=k mod 10)

The file will eventually double its size


again
Deletions cause merging of pages
whenever a merging criterion is
satisfied (e.g., u < B%)
E.G.M. Petrakis

Hashing

38

Hash Functions
Initially n pages and 0 <= h0(k) <= n
Series of hash functions
hi (k)
hi +1 (k) =
i
hi (k) + n2
Selection of hash function:
if hi(k) >= p then use hi(k)
else use hi+1(k)
E.G.M. Petrakis

Hashing

39

Linear Hashing with Partial Expansions


(Larson 1980)

Problem with Linear Hashing: pages to the

right of p delay to split


large chains of overflows on rightmost pages
Solution: do not wait that much to split a
page
k partial expansions: take pages in groups of k
all k pages of a group split together
the file grows at lower rates

E.G.M. Petrakis

Hashing

40

Two Partial Expansions


Initially 2n pages, n groups, 2 pages/group
groups: (0, n) (1, 1+n)(i, i+n) (n-1, 2n-1)

0 1

2n

2 pointers
to pages of
the same group

Pages in same group spit together => some


records go to a new page at end of file
(position: 2n)

E.G.M. Petrakis

Hashing

41

st
1

Expansion

After n splits, all pages are split


the file has 3n pages (1.5 time larger)
the file grows at lower rate
after

0
1st

2n

3n

expansion take pages in groups


of 3 pages: (j, j+n, j+2n), 0 <= j <= n

E.G.M. Petrakis

Hashing

2n

3n

42

nd
2

Expansion

After n splits the file has size 4n


repeat the same process having initially
4n pages in 2n groups

2 pointers
to pages of
the same group

0 1

E.G.M. Petrakis

2n

4n

Hashing

43

disk access/retrieval

1,6

Linear
Hashing

1,5
1,4

Linear
Hashing
2 partial
expansions

1,3
1,2
1,1
1

retrieval
insertion
deletion
E.G.M. Petrakis

1,2

1,6

relative file size

Linear
Hashing

1.17
3.57
4.04

1,4

1,8

Linear
Hashing Linear Hashing
2 part. Exp. 3 part. Exp.

1.12
3.21
3.53

Hashing

1.09
3.31
3.56

b=5
b = 5
u = 0.85
44

Dynamic Hashing Schemes


Very good performance on membership,

insert, delete operations


Suitable for both main memory and disk
b=1-3 records for main memory
b=1-4 Kbytes for disk
Critical parameter: space utilization u
large u => more overflows, bad performance
small u => less overflows, better performance
Suitable for direct access queries (random
accesses) but not for range queries

E.G.M. Petrakis

Hashing

45

Potrebbero piacerti anche