Sei sulla pagina 1di 31

Introduction to Indexing

Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

File Organization and Indexing


Assume that we have a large amount of data in
our database which lives on a hard drive(s)
What are some of the things we might wish to
do with the data?
Scan:

Fetch all records from disk


Equality search
Range search
Insert a record
Delete a record
How expensive are these operations?

(in terms of execution time)

Ways to Organize Data


The cost of operations listed above depends on how
we organize data.
There are three main ways we could organize the
data
Heap Files
Sorted File (Tree Based Indexing)
Hash Based Indexing

Scan/ Equality search/ Range selection/ Insert a record/ Delete a record

Important Points
Data which is organized based on one field, may be difficult to search
based on a different field.
Consider a phone book. The data is well organized if you want to find
Jessica Lins phone number. On the other hand, finding out whose
number 234-2342 belongs to is much harder!
Informally, the attribute we are most interested in searching is called
the search key, or just key (we will formalize this notation later).
Note that the search key can be a combination of fields, for example
phone books are organized by <Last_name, First_name>
Unfortunately, the word key is overloaded in databases, the word key in this context, has
nothing to do with primary key, candidate key etc.

1) Heap Files
The data is unsorted in heap files. We can initially build the
database in sorted order, but if our application is dynamic, our database
will become unsorted very quickly. So we assume that heap files are
unsorted.

Data
Page

Data
Page

Data
Page

Data
Page

Data
Page

Data
Page

Full Pages

Header
Page
Pages with
Free Space

2) Sorted File (Tree Based Indexing)


If we are willing to pay the overhead of keeping the
data sorted on some field, we can index the data on
that field.
17

Entries <= 17
5

Data
Page

13

14

Data
Page

Entries > 17
18

16

Data
Page

Data
Page

30

35

Data
Page

43

Data
Page

3) Hash Based Indexing


With hash-based indexing, we assume that we have a
function h, which tells us where to place any given
record.

h
Data
Page

Data
Page

Data
Page

Data
Page

Data
Page

Data
Page

Basic Concepts
Indexing mechanisms are used to speed up access to desired
data.
e.g., author catalog in library

Search Key - attribute (or set of attributes) is used to look up


records in a file.
An index file consists of records (called index entries, or
data entries) of the form
search-key"

pointer"

Index files are typically much smaller than the original file
Two basic kinds of indices:
Ordered indices: search keys are stored in sorted order (I.e tree based)
Hash indices: search keys are distributed uniformly across buckets
using a hash function .

How do indices achieve speedup?


Maggie

Blue

100.30

Marge

Red

12.34

Homer

Pink

32,12

search-key"

pointer"

search-key"

pointer"

Bart

Blue

null

search-key"

pointer"

Lisa

Black

45.12

Seymour

Red

56.91

Apu

3 Green

Manjula

Blue

234.23

Lenny

4 White

45.34

search-key"

pointer"

1) The index is typically much smaller than a record


2) A data entry may point to several records

Chapter 10
Tree-Structured Indexing
First, we look at a simple approach (ISAM)
We will see why it is unsatisfactory
This will motivate the B+ tree

Indexed Sequential Access Method


If our large database is sorted, we can speed up search by
doing binary search on the entire database.
However, this means we must do log(N) disk accesses
The idea of ISAM is to do a faster, approximate binary search
in main memory, and use this information to do fewer disk
accesses (usually only one).

Data
Page 1

Data
Page 2

Data
Page 3

::

Data
Page N-1

Data
Page N

Indexed Sequential Access Method


An index entry is a <key,pointer> pair, where
key is the value of the first key on the page, and
pointer, points to the page.
Example

K P

Maggie page 7

Data Page 7
Maggie

Manjula

Marge

Monty

Indexed Sequential Access Method


An index file is a concatenation of index entries.
Together with one extra pointer at the beginning.
Example

K3 P0 K1 P1 K2 P2 K3 P3

Maggie page 1 Maggie page 2 Waylon page 3

Indexed Sequential Access Method


Lets look at an index file (this one is the smallest possible example)

K3 P3 7 P4

Every record pointed to


by this pointer has a value
less that 7

Every record pointed to


by this pointer has a value
greater that or equal to 7

Indexed Sequential Access Method


Instead of doing binary search on the data files, we can do binary search on the index, to
find the largest value, which is equal to or less than the search key. We then use the pointer
to go to disk to retrieve the relevant block from disk.
Example: we are searching for 8, we do a binary search to find 5, we retrieve page 2, and
search it to find a match (if there is one).

Index File

12 p ::

16

19 p

Data Files
Data
Page 1

Data
Page 2

Data
Page 3

::

Data
Page N-1

Data
Page N

Indexed Sequential Access Method


How big should the index file be?
How about more pointers per page?
We could have two pointers to each page (on average, or exactly).
This does not help, because we have to retrieve a block at a time.

Index File

::

34

77 p

Data Files

Data
Page 1

Data
Page 2

Data
Page 3

::

Data
Page N-1

Data
Page N

Indexed Sequential Access Method


How big should the index file be?
How about less pointers per page?
We could have a pointer for each two pages (on average, or
exactly).
This might help, because it makes the index smaller. We can do a
little trick of adding sideways pointers.

Index File

12 p ::

16

19 p

Data Files
Data
Page 1

Data
Page 2

Data
Page 3

::

Data
Page N-1

Data
Page N

Indexed Sequential Access Method


We have seen that too small or too large an index (in other words too few or too
many pointers) can be a problem. But suppose the index does not fit in main
memory?
The key observation is that the index itself is a sort of database, so let s build an
index on the index!

p 21

Index File

12 p ::

16

19 p

Data Files
Data
Page 1

Data
Page 2

Data
Page 3

::

Data
Page N-1

Data
Page N

Tree Based Indexing


An index of indices is a tree!
We can use this structure to do fast equality search. Find 15, 0
What about range search?
It looks like we have solved our fast indexing problem, but there is
a catch. what happens if we have a deletion, or an insertion?
Define:
root
internal node
leaf
5

Data
Page 1

Data
Page 2

17

13

14

Data
Page 3

18

16

Data
Page 4

Data
Page 5

Data
Page 6

Data
Page 7

30

35

Data
Page 8

43

Data
Page 9

Data
Page 10

Tree Based Indexing


What happens if we have a deletion? (not much)
What happens if we have an insertion? (trouble!)
Solution: Overflow Buckets
If we have enough overflow buckets, we might as well have no index at all
17

Suppose we add a
bunch of 15 year olds
to the database
5

Data
Page 1

Data
Page 2

13

14

Data
Page 3

18

16

Data
Page 4

Overflow 1

Data
Page 5

Data
Page 6

Data
Page 7

30

35

Data
Page 8

43

Data
Page 9

Data
Page 10

B+-Tree Index Files


B+-tree indices are an alternative to indexed-sequential files.

Disadvantage of indexed-sequential files:


performance degrades as file grows, since many
overflow blocks get created. Periodic reorganization
of entire file is required.
Advantage of B+-tree index files: automatically
reorganizes itself with small, local, changes, in the
face of insertions and deletions. Reorganization of
entire file is not required to maintain performance.
Disadvantage of B+-trees: extra insertion and
deletion overhead, space overhead.
Advantages of B+-trees outweigh disadvantages, and
they are used extensively.

B+-Tree Index Files (Cont.)


A B+-tree is a rooted tree satisfying the following properties:
All paths from root to leaf are of the same length
Two types of nodes: index (internal) nodes and data
(leaf) nodes. Each node is one disk page.
Each node must have minimum 50% occupancy
(except for root). Each node contains d <= m <= 2d
entries/pointers.
d is the order/branching factor/capacity of the tree
The root must have at least 2 children

B+-Trees Example
Root

17

Entries <= 17
5

2*

3*

Entries > 17
27

13

5*

7* 8*

14* 16*

22* 24*

30

27* 29*

33* 34* 38* 39*

Queries on

+
B -Trees

Find all records with a search-key value of k.


1. Start with the root node
1. Examine the node for the smallest search-key value > k.
2. If such a value exists, assume it is Kj. Then follow Pi to the child
node
3. Otherwise k Km1, where there are m pointers in the node. Then
follow Pm to the child node.

2. If the node reached by following the pointer above is not a


leaf node, repeat the above procedure on the node, and
follow the corresponding pointer.
3. Eventually reach a leaf node. If for some i, key Ki = k
follow pointer Pi to the desired record. Else no record with
search-key value k exists.

Queries on B+-Trees
Find 28*, Find 0*, Find all records > 25
Root

17

Entries <= 17
5

2*

3*

Entries > 17
27

13

5*

7* 8*

14* 16*

22* 24*

30

27* 29*

33* 34* 38* 39*

Queries on B+-Trees (Cont.)


In processing a query, a path is traversed in the tree from the
root to some leaf node.
If there are K search-key values in the file, the path is no
longer than logn/2(K).
A node is generally the same size as a disk block, e.g. 4
kilobytes, and n = 2d is typically around 100 (40 bytes per
index entry).
With 1 million search key values and n = 100, at most
log50(1,000,000) = 4 nodes are accessed in a lookup.
Contrast this with a balanced binary tree with 1 million search
key values around 20 nodes are accessed in a lookup
above difference is significant since every node access may need a disk
I/O, costing around 20 milliseconds!

Updates on

+
B -Trees:

Insertion

Find the leaf node in which the search-key value


would appear
If the search-key value is already there in the leaf
node, record is added to file and if necessary a
pointer is inserted into the bucket.
If the search-key value is not there, then add the
record to the main file and create a bucket if
necessary. Then:
If there is room in the leaf node, insert (key-value,
pointer) pair in the leaf node
Otherwise, split the node (along with the new (key-value,
pointer) entry) as discussed in the next slide.

Updates on

+
B -Trees:
13

17

24

Insertion

30

Insert 23

2*

3* 5*

7*

14* 16*

This is the easy case!

2*

3* 5*

7*

19* 20* 22*

13

14* 16*

17

24

19* 20* 22* 23*

24* 27* 28*

40* 41* 45* 77*

30

24* 27* 28*

40* 41* 45* 77*

Updates on B+-Trees: Insertion


13

17

24

30

Insert 8
2*

3* 5*

7*

14* 16*

24* 27* 28*

19* 20* 22*

13

17

24

40* 41* 45* 77*

30

2*

3*

5*

7* 8*

14* 16*

19* 20* 22*

24* 27* 28*

40* 41* 45* 77*

Because the insertion will cause overfill, we split the leaf node into two nodes, we split the data
into two nodes (and distribute the data evenly between them). 5 is special, since it
discriminates between the two new siblings, so it is copied up.
We now need to insert 5 into the parent node

Updates on B+-Trees: Insertion


We now need to insert 5 into the parent node

13

17

24

30

2*

3*

5*

7* 8*

14* 16*

19* 20* 22*

24* 27* 28*

40* 41* 45* 77*

24* 27* 28*

40* 41* 45* 77*

17

2*

3*

5*

13

7* 8*

24

14* 16*

30

19* 20* 22*

Because the insertion will cause overfill, we split the node into two nodes, we split the data into two nodes. 17 is special,
since it discriminates between the two new siblings, so it is pushed up.

Updates on

+
B -Trees:

Insertion

17

2*

3*

5*

13

7* 8*

24

14* 16*

30

19* 20* 22*

24* 27* 28*

40* 41* 45* 77*

17

2*

3*

5*

13

7* 8*

24

14* 16*

The insertion of 8 has


increased the height of the
tree by one (this is rare).

30

19* 20* 22*

24* 27* 28*

40* 41* 45* 77*