Hashing On The Disk: Keys Are Stored in " " (" ") Retrieval

Hashing on the Disk
Keys are stored in disk pages

(buckets)
several records fit within one page
Retrieval:
find address of page
bring page into main memory
searching within the page comes for
free
E.G.M. Petrakis
Hashing
data pages
key
space
0
1
hash
function
2
.
.
.
.
m-1
page size b: maximum number of records in page
space utilization u: measure of the use of space

u=
E.G.M. Petrakis
# stored records
# pages b
Hashing
Collisions
Keys that hash to the same address
are stored within the same page

If the page is full:
i. page splits: allocate a new page and

ii.
split page content between the old and

the new page or
overflows: list of overflow pages
xxxx
E.G.M. Petrakis
Hashing
overflow
xx
Access Time
Goal: find key in one disk access
Access time ~ number of accesses
Large u: good space utilization but
many overflows or splits => more disk

accesses
Non-uniform key distribution => many
keys map to the same addresses =>
overflows or splits => more accesses
E.G.M. Petrakis
Hashing
Categories of Methods
Static: require file reorganization
open addressing, separate chaining
Dynamic: dynamic file growth, adapt

to file size
dynamic hashing,
extendible hashing,
linear hashing,
spiral storage
E.G.M. Petrakis
Hashing
Dynamic Hashing Schemes

File size adapts to data size without
total reorganization
Typically 1-3 disk accesses to access
a key
Access time and u are a typical tradeoff
u between 50-100% (typically 69%)
Complicated implementation
E.G.M. Petrakis
Hashing
Schemes With Index

Two disk accesses:
one to access the index, one to access the data
with index in main memory => one disk access
Problem: the index may become too large
index data pages
Dynamic hashing (Larson 1978)

Extendible hashing (Fagin et.al. 1979)
E.G.M. Petrakis
Hashing
Schemes Without Index

Ideally, less space and less disk accesses
(at least one)
address space
data space
Linear Hashing (Litwin 1980)

Linear Hashing with Partial Expansions
(Larson 1980)
Spiral Storage (Martin 1979)
E.G.M. Petrakis
Hashing
Hash Functions
Support for shrinking or growing file
shrinking or growing address space, the
hash function adapts to these changes

hash functions using first (last) bits of
key = bn-1bn-2.bi b i-1b2b1b0
hi(key)=bi-1b2b1b0 supports 2i addresses
hi: one more bit than hi-1 to address
larger files
hi 1 (key)
hi (key) =
i
+
h
(key)
2
i 1
E.G.M. Petrakis
Hashing
Dynamic Hashing (Larson 1978)

Two level index
primary h1(key): accesses a hash table

secondary h2(key): accesses a binary
tree Index: binary tree
1
2
3
4
h1(k)
E.G.M. Petrakis st
1 level
h2(k)
Hashing
2nd
level
data pages
10
Index
Fixed (static): h1(key) = key mod m
Dynamic behavior on secondary index
h2(key) uses i bits of key
the bit sequence of h2=bi-1b2b1b0
denotes which path on the binary tree

index to follow in order to access the
data page
scan h2 from left to right (bit 1: follow
right path, bit 0: follow left path)
E.G.M. Petrakis
Hashing
11
0
1
2
3
4
5
h1(k)
1st level
index
1
0
h1=1, h2=0
h1=1, h2=01
h1=1, h2=11
h1=5, h2= any
4
b
data pages
h2(k)
2nd level
h1(key) = key mod 6

h2(key) = 10 <= depth of binary tree = 2
E.G.M. Petrakis
Hashing
12
Insertions
Initially fixed size primary bindex and
no data
0
1
2
3
0
1
2
3
h1=1,h2=any
insert record in new page under h1address

if page is full, allocate one extra page
split keys between old and new page
use one extra bit in h2 for addressing
E.G.M. Petrakis
0
1
2
3
0
1
Hashing
h1=1, h2=0
h1=1, h2=1
13
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
E.G.M. Petrakis
h1=0, h2=any
b
2
index
1
0
1
0
1
0
1
storage
1
h1=0, h2=0
h1=0, h2=1
h1=3, h2=any
1
3
4
2
h1=0, h2=0
5
Hashing
h1=3, h2=any
h1=0, h2=01
h1=0, h2=11
h1=3, h2=0
h1=3, h2=1
14
Deletions
Find record to be deleted using h1, h2
Delete record
Check sibling page:
less than b records in both pages ?

if yes merge the two pages
delete one empty page
shrink binary tree index by one level and
reduce h2 by one bit
E.G.M. Petrakis
Hashing
15
0
1
2
3
merging
2
3
4
0
1
2
3
delete
3
4
E.G.M. Petrakis
Hashing
16
Extendible Hashing (Fagin et.al. 1979)
Dynamic hashing without index

Primary hashing is omitted
Only secondary hashing with all binary
trees at the same level
The index shrinks and grows
according to file size
Data pages attached to the index
E.G.M. Petrakis
Hashing
17
0
1
2
3
4
0
1
2
3
4
dynamic
hashing
0
1
0
1
00
01
10
11
E.G.M. Petrakis
Hashing
dynamic
hashing with
all binary trees
at same level
number of
address bits
18
Insertions
Initially 1 index and 1 data page
0 address bits
insert records in data page
index
global depth d:
size of index 2d
storage
local depth l :
Number of address bits
b
E.G.M. Petrakis
Hashing
19
Page 0 Overflows
d
index
storage
d: global depth = 1
l : local depth = 1
0
1
E.G.M. Petrakis
Hashing
20
Page 0 Overflows (cont.)

1 more key bit for addressing and 1 extra
page => index doubles !!

Split contents of previous page between 2
pages according to next bit of key
Global depth d: number of index bits => 2d
index size
Local depth l : number of bits for record
addressing
E.G.M. Petrakis
Hashing
21
Page 0 Overflows (again)

contain records
with same 2 bits of key
d
00
01
10
11
2
2
l d
contains records
with same 1st bit of key
E.G.M. Petrakis
Hashing
22
Page 01 Overflows
d
000
001
010
011
100
101
110
111
E.G.M. Petrakis
2
3
3
Hashing
1 more key bit

for addressing
2d-l: number of
pointers to page
23
Page 100 Overflows

2
000
001
010
011
100
101
110
111
3
3
2
2
+1
no need to double index

page 100 splits into two (1 new page)
local depth l is increased by 1
E.G.M. Petrakis
Hashing
24
Insertion Algorithm
If l < d, split overflowed page (1 extra
page)
If l = d double index, split page and
d is increased by 1=>1 more bit for addressing
update pointers (either way):
a) if d prefix bits are used for addressing
d=d+1;
for (i=2d-1, i>=0,i--) index[i]=index[i/2];

b) if d suffix bits are used
for (i=0; i <= 2d-1; i++) index[i]=index[i]+2d-1;
d=d+1
E.G.M. Petrakis
Hashing
25
Deletion Algorithm
Find and delete record
Check sibling page
If less than b records in both pages
merge pages and free empty page
decrease local depth l by 1 (records in
merged page have 1 less common bit)

if l < d everywhere => reduce index (half
size)
update pointers
E.G.M. Petrakis
Hashing
26
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
E.G.M. Petrakis
delete with
merging
3
2
2
2
2
l<d
2
2
00
01
10
11
2
2
2
Hashing
27
Observations
A page splits and there are more than b
keys with same next bit

take one more bit for addressing (increase l)
if d=l the index doubles again !!
Hashing might fail for non-uniform
distributions of keys (e.g., multiple keys
with same value)
if distribution is known, transform it to uniform
Dynamic hashing performs better for nonuniform distributions (affected locally)
E.G.M. Petrakis
Hashing
28
Performance
For n: records and page size b
expected size of index (Flajolet)
1
(1 + )
n b
1
(1 + )
n b
l
3.92
blog2
b
1 disk access/retrieval when index in
main memory
2 disk accesses when index is on disk
overflows increase number of disk
accesses
E.G.M. Petrakis
Hashing
29
Storage Utilization with Page

Splitting
b
b
before splitting
b
u =
= 50%
2b
after splitting
After splitting
In general 50% < u < 100%

On the average u ~ ln2 ~ 69% (no overflows)
E.G.M. Petrakis
Hashing
30
Storage Utilization with

Overflows
Achieves higher u and avoids page doubling (d=l)
higher u is achieved for small overflow pages
u=2b/3b~66% after splitting

small overflow pages (e.g., b/2) => u = (b+b/2)/2b ~ 75%
double index only if the overflow overflows!!
E.G.M. Petrakis
Hashing
31
Linear Hashing (Litwin 1980)

Dynamic scheme without index
Indices refer to page addresses
Overflows are allowed
The file grows one page at a time
The page which splits is not always
the one which overflowed

The pages split in a predetermined
order
E.G.M. Petrakis
Hashing
32
Linear Hashing (cont.)

Initially n empty pages
p points to the page that splits
p
Overflows are allowed

p
E.G.M. Petrakis
Hashing
33
File Growing
A page splits whenever the splitting
criterion is satisfied
a page is added at the end of the file

pointer p points to the next page
split contents of old page between old
and new page based on key values
p
E.G.M. Petrakis
Hashing
34
125
320
90
435
16
711
402
27
737
712
613
303
4
319
215
522
u=
17
> 80% split
22
438
new element
b=bpage=4, boverflow=1
initially n=5 pages
hash function h0=k mod 5
splitting criterion u > A%
alternatively split when overflow overflows,
etc.
E.G.M. Petrakis
Hashing
35
613
303
438
4
319
125
435
215
h0
h0
h1
320
90
16
711
402
27
737
712
h1
h0
h0
522
18
u=
< 80%
25
Page 5 is added at end of file

The contents of page 0 are split between
pages 0 and 5 based on hash function h1 =

key mod 10
p points to the next page
E.G.M. Petrakis
Hashing
36
Hash Functions
Initially h0=key mod n
As new pages are added at end of file, h0
alone becomes insufficient
The file will eventually double its size
In that case use h1=key mod 2n
In the meantime
use h0 for pages not yet split
use h1 for pages that have already split
Split contents of page pointed to by p
based
E.G.M.
Petrakis on h1
Hashing
37
Hash Functions (cont.)

When the file has doubled its size, h0
is no longer needed
set h0=h1 and continue (e.g., h0=k mod 10)
The file will eventually double its size

again
Deletions cause merging of pages
whenever a merging criterion is
satisfied (e.g., u < B%)
E.G.M. Petrakis
Hashing
38
Hash Functions
Initially n pages and 0 <= h0(k) <= n
Series of hash functions
hi (k)
hi +1 (k) =
i
hi (k) + n2
Selection of hash function:
if hi(k) >= p then use hi(k)
else use hi+1(k)
E.G.M. Petrakis
Hashing
39
Linear Hashing with Partial Expansions

(Larson 1980)
Problem with Linear Hashing: pages to the
right of p delay to split

large chains of overflows on rightmost pages
Solution: do not wait that much to split a
page
k partial expansions: take pages in groups of k
all k pages of a group split together
the file grows at lower rates
E.G.M. Petrakis
Hashing
40
Two Partial Expansions

Initially 2n pages, n groups, 2 pages/group
groups: (0, n) (1, 1+n)(i, i+n) (n-1, 2n-1)
0 1
2n
2 pointers
to pages of
the same group
Pages in same group spit together => some

records go to a new page at end of file
(position: 2n)
E.G.M. Petrakis
Hashing
41
st
1
Expansion
After n splits, all pages are split

the file has 3n pages (1.5 time larger)
the file grows at lower rate
after
0
1st
2n
3n
expansion take pages in groups

of 3 pages: (j, j+n, j+2n), 0 <= j <= n
E.G.M. Petrakis
Hashing
2n
3n
42
nd
2
Expansion
After n splits the file has size 4n

repeat the same process having initially
4n pages in 2n groups
2 pointers
to pages of
the same group
0 1
E.G.M. Petrakis
2n
4n
Hashing
43
disk access/retrieval
1,6
Linear
Hashing
1,5
1,4
Linear
Hashing
2 partial
expansions
1,3
1,2
1,1
1
retrieval
insertion
deletion
E.G.M. Petrakis
1,2
1,6
relative file size
Linear
Hashing
1.17
3.57
4.04
1,4
1,8
Linear
Hashing Linear Hashing
2 part. Exp. 3 part. Exp.
1.12
3.21
3.53
Hashing
1.09
3.31
3.56
b=5
b = 5
u = 0.85
44
Dynamic Hashing Schemes

Very good performance on membership,
insert, delete operations

Suitable for both main memory and disk
b=1-3 records for main memory
b=1-4 Kbytes for disk
Critical parameter: space utilization u
large u => more overflows, bad performance
small u => less overflows, better performance
Suitable for direct access queries (random
accesses) but not for range queries
E.G.M. Petrakis
Hashing
45

Hashing On The Disk: Keys Are Stored in " " (" ") Retrieval

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hashing On The Disk: Keys Are Stored in " " (" ") Retrieval

Caricato da

Copyright:

Formati disponibili

Hashing on the Disk

Keys are stored in disk pages

several records fit within one page

page size b: maximum number of records in page

space utilization u: measure of the use of space

are stored within the same page

i. page splits: allocate a new page and

split page content between the old and

many overflows or splits => more disk

Dynamic: dynamic file growth, adapt

Dynamic Hashing Schemes

Schemes With Index

Dynamic hashing (Larson 1978)

Schemes Without Index

Linear Hashing (Litwin 1980)

hash function adapts to these changes

Dynamic Hashing (Larson 1978)

primary h1(key): accesses a hash table

denotes which path on the binary tree

h1(key) = key mod 6

insert record in new page under h1address

less than b records in both pages ?

Extendible Hashing (Fagin et.al. 1979)

Dynamic hashing without index

Page 0 Overflows (cont.)

page => index doubles !!

Page 0 Overflows (again)

1 more key bit

Page 100 Overflows

no need to double index

for (i=2d-1, i>=0,i--) index[i]=index[i/2];

merged page have 1 less common bit)

keys with same next bit

Storage Utilization with Page

In general 50% < u < 100%

Storage Utilization with

higher u is achieved for small overflow pages

u=2b/3b~66% after splitting

Linear Hashing (Litwin 1980)

the one which overflowed

Linear Hashing (cont.)

Overflows are allowed

a page is added at the end of the file

Page 5 is added at end of file

pages 0 and 5 based on hash function h1 =

Hash Functions (cont.)

set h0=h1 and continue (e.g., h0=k mod 10)

The file will eventually double its size

Linear Hashing with Partial Expansions

Problem with Linear Hashing: pages to the

right of p delay to split

Two Partial Expansions

Pages in same group spit together => some

After n splits, all pages are split

expansion take pages in groups

After n splits the file has size 4n

relative file size

Dynamic Hashing Schemes

insert, delete operations

Potrebbero piacerti anche