Sei sulla pagina 1di 58

File Organisation

Placing File on Disk


 

File a sequence of records Records


Record type  Record fields  Data type  Number of bytes in a field


fixed  Variable


Record Characteristics


A logical view:
SELECT * FROM STUDENTS or  (Smith, 17, 1, CS) , (Brown, 8, 2, CS) or  STUDENT(Name, Number, Class, Major) Number,


A physical view:



(20 bytes + 4 bytes + 4 bytes + 3 bytes)


data types determine record length

- -records can be of fixed or variable length

Fixed Versus Variable Length Records




FIXED LENGTH:
every record has same fields  field can be located relative to record start


VARIABLE LENGTH - FIELDS:


Some fields have unknown length  use a field separator  Use a record terminator WHAT IF RECORDS ARE SMALLER THAN A BLOCK? - BLOCKING FACTOR WHAT IF RECORDS ARE LARGER THAN A BLOCK? - SPANNING RECORDS


Record blocking
Allocating records to disk blocks  Unspanned records  Each record is fully contained in one block  Many records in one block  Blocking factor bfr number of records that fit in one block Example: Block size B = 1024 record size (fixed) R = 150 bfr = 1024/150 = 6 (floor and ceiling functions) (floor functions)  Spanned organization  Record continued on the consecutive block  Required pointer to point the block with the remainder of a record  If records are of a variable length , then bfr could represent the average number of records per bloc (the rounding function does not apply)

File structure


File as a set of pages (disk blocks) storing records File header


Record format, types of separators  Block address(es)


Blocks allocated
Contiguous  Linked (use of block pointers)  Linked clusters  Indexed


Searching for a record


Search for a record on disk,  one or more file blocks copied into buffers.  Programs search for the desired record in the buffers, using the information in the file header.  If the address of the block with desired record is not known, the search programs must do a linear search through the file blocks. Each file block is copied into a buffer and searched either until the record is located or all the file blocks have been searched unsuccessfully. The goal of a good file organization is to locate the block that contains a desired record with a minimal number of block transfers

Operations on Files
Because of complex path from stored data to user, DBMS offer a range of I/O operations:
OPEN - access the file and prepare pointer FIND (LOCATE) - find first record FINDNEXT FINDALL - set READ INSERT DELETE MODIFY CLOSE REORGANISE - set READREAD-ORDERED (FIND-ORDERED) - set (FIND-

File organization and access method. method.




Difference between the terms


 

file organization and access method.

A file organization is organization of the data of a file into records, blocks, and access structures;


way of placing records and blocks on the storage medium

An access method provides a group of operations that can be applied to a file resulting in retrieval, modification and reorganisation.


One file organization can accept many different access methods Some access methods, though, can be applied only to files with specific file organization. For example, one cannot apply an indexed access method to a file without an index

Why do Access Methods matter


The unit of transfer between disk and main memory is a block  Data must be in memory for the DBMS to use it  DBMS memory is handled in units of a page, e.g. 4K, 8K. Pages in page, memory represent one or more hardware blocks from the disk  If a single item is needed, the whole block is transferred  Time taken for an I/O depends on the location of the data on the disk and is lower if the number of seek times and rotational delays are small, we remember that: access time = seek times + rotational delays +

transfer times


The reason many DBMS do not rely on the OS file system is:  higher level DB operations, e.g. JOIN, have a known pattern of page accesses and can be translated into known sets of I/O operations  buffer manager can PRE-FETCH pages by anticipating the next request. PREThis is especially efficient when the required data are stored CONTIGUOUSLY on disk

Simple File Organisations


Unordered files of records: Heap or Pile file  New records inserted at EOF, or anywhere  locating a record is by a linear search  insertion is easy  retrieval of an individual record, or in any order, is difficult (time consuming).  Question. How many blocks in average one needs to reed to find a single record ?
Fast: Select * from Course Slow: Select count(*) from Course group by Course_Number

Operations on Unordered File


Inserting a new record is very efficient:
  

The address of the last file block is kept in the file header The last disk block of the file is copied into a buffer page; The new record is added or new page is opened; the page is then rewritten back to disk block.

Searching for a record using any search condition in a file stored in b blocks


Linear search through the file, block by block


 

Cost = b/2 block transfers. on average, if only one record satisfies the search condition, Cost = b block transfers. If no records or several records satisfy the search condition. program must read and search all b blocks in the file.

To delete a record,
find its block and copy the block into a buffer page,  delete the record from the buffer,  rewrite the updated page back to the disk block. Note: Unused space in the block could be used in future for a new record if suitable (some book keeping necessary on unused space in file blocks))


Special Deletion Procedures


Technique used for record deletion
   

Each record has an extra byte or bit, called a deletion marker set to 1 at insertion *) DO not remove deleted record, but reset its deletion marker to 0 when deleted Record with deletion marker set to 0 is not used by application programs From time to time reorganise the file: physically remove deleted records or reclaim unused space.

*) Just for simplicity we assume that values of deletion markers are 0 or 1. A system actually can choose other characters or combination of bits as values of deletion markers.

Simple File Organisations


Ordered files of records - sequential files  still extremely useful in DBM (auditing, recovery, security)  A record field is nominated and records are ordered based on that field


Ordering key

insertion is expensive  retrieval is easy (efficient) if exploiting the sort order  binary search reduces time significantly


Fast: Select * from Course order by <order> Slow: Select * from Course where <any other attribute> = c

Retrieval & Update in Sorted Files




Binary search on ordering field to find block with key = k:


Do while not (Found or NotThere) Read Block Mid = (Low + High) / 2 (Low High)
If k < key field of first record in the block Then High = Mid - 1 Else If k > key field of last record Then Low = Mid + 1 Else If k record is in the buffer Then Found Else NotThere

B = # of blocks; High:= B; Low := 0 High:=

end

Operations on Ordered File


Searching for records when criteria are specified in terms of ordering field
 

 

Reading the records in order of the ordering key values is extremely efficient, Finding the next record from the current one in order of the ordering key usually requires no additional block accesses,  the next record is in the same block or in the next block using a search condition based on the value of an ordering key field results in faster access when the binary search technique is used, A binary search can be done on the blocks rather than on the records.. A binary search usually accesses log2(b) blocks, whether the record is found or not No advantage if search criterion is specified in terms of non ordering fields

Operations on Ordered File(cndt)


Inserting records is expensive. To insert a record
find its correct position in the file, based on its ordering field value, - cost log2(b)  make space in the file to insert the record in that position.  on the average, half the records of the file must be moved to make space for the new record.  these file blocks must be read and rewritten to keep the order. Cost of insertion is then =b/2 block transfers


Operations on Ordered File(cndt)


Deleting record.
Find the record using binary search based on ordering field value, - cost log2(b)  Delete the record,  Reorganise part of the file (all records after that deleted one, b/2 blocks in average)


Modifying record


Find record using binary search and update as required

Operations on Ordered File


Alternative ways for more efficient insertion
keep some unused space in each block for new records (not good - problem returns when that space is filled up)  create and maintain a temporary unordered file called an overflow file. file.  New records are inserted at the end of the overflow file  Periodically, the overflow file is sorted and merged with the main file during file reorganization.  Searching for a record must involve both files, main and overflow; the cost of searching is thus more expensive but for large main file will be still close to log2(b)


Alternative way for more efficient deletion




Use the technique based on deletion marker, as described earlier

Access Properties of Simple Files



R4 R2 R3 R16

Heap (sequential unordered)


------------------------R1 R7 R35 R10 ------------------------R14 R12 R23 R6 ------------------------R24 ------R27 -------


R1 R2 R3 R4

Ordered (sequential) file


------------------------R6 R7 R10 R12 ------------------------R14 ------R16 ------R23 ------R24 ------R27 ------R35 -------

Access Properties of Simple Files


Insert into Heap file record R15
R4 R2 R3 R16 ------------------------R1 R7 R35 R10 ------------------------R14 R12 R23 R6 ------------------------R24 ------R27 -------

And after insertion


R4 R2 R3 R16 ------------------------R1 R7 R35 R10 ------------------------R14 R12 R23 R6 ------------------------R24 ------R27 ------R15 -------

Access Properties of Simple Files


Insert into Ordered file record R15
R1 R2 R3 R4 ------------------------R6 R7 R10 R12 ------------------------R14 ------R16 ------R23 ------R24 ------R27 ------R35 -------

And after insertion


R1 R2 R3 R4 ------------------------R6 R7 R10 R12 ------------------------R14 R15 R16 R23 ------------------------R24 ------R27 ------R35 -------

Notice that all records after R15 have changed their page location or position on the page

Access Properties of Simple Files


Insert into Ordered file records R15, R9, R17 using overflow file
R1 R2 R3 R4 ------------------------R6 R7 R10 R12 ------------------------R14 ------R16 ------R23 ------R24 ------R27 ------R35 -------

Main File Overflow File

And after insertion


R1 R2 R3 R4 ------------------------R6 R7 R10 R12 ------------------------R14 ------R16 ------R23 ------R24 ------R27 ------R35 -------

R15 ------R9 ------R17 -------

Main File Overflow File Periodically overflow file is sorted and merged with the main file

Access Properties of Simple Files



R4 R2 R3 R16

Deletions from a Heap: R10, R3, R7: Simple delete:


------------------------R1 R7 R35 R10 ------------------------R14 R12 R23 R6 ------------------------R24 ------R27 -------

After delete operations


R1 ------R35 ------R14 R12 R23 R6 ------------------------R24 ------R27 -------

R4 ------R2 ------R16 -------

Access Properties of Simple Files



R4 R2 R3 R16

Deletions from a Heap: R10, R3, R7: using deletion marker technique
------------------------1 1 1 1 R1 R7 R35 R10 ------------------------1 1 1 1 R14 R12 R23 R6 ------------------------1 1 1 1 R24 ------R27 ------1 1

After delete operations


R4 R2 R3 R16 ------------------------1 1 0 1 R1 R7 R35 R10 ------------------------1 0 1 0 R14 R12 R23 R6 ------------------------1 1 1 1 R24 ------R27 ------1 1

Deletion markers set to 0 and later these records will be physically removed when file is reorganised

Access Properties of Simple Files



R1 R2 R3 R4

Deletions from ordered file: R10, R3, R7: Simple delete:


------------------------R6 R7 R10 R12 ------------------------R14 ------R16 ------R23 ------R24 ------R27 ------R35 -------


R1 R2 R4 R6

After delete operations


R12 ------R14 ------R16 ------R23 ------R24 ------R27 ------R35 -------

-------------------------

Access Properties of Simple Files




R1 R2 ------R3 ------R4 -------

Deletions from ordered file: R10, R3, R7: Using deletion marker technique:R27 ------- 1 ------1 R6 ------- 1 R14 ------1
1 1 1 R7 ------R10 ------R12 ------1 1 1 R16 ------R23 ------R24 ------1 1 1 R35 ------1

After delete operations


R1 R2 R3 R4 ------------------------1 1 0 1 R6 R7 R10 R12 ------------------------1 0 0 1 R14 ------R16 ------R23 ------R24 ------1 1 1 1 R27 ------R35 ------1 1

Deletion markers set to 0 and later these records will be physicaly removed when file is reorganised

Retrieval and Update In Heaps


Quick summary  can only use linear search  insertion is fast  deletion, update are slow  parameter search (e.g. SELECTWHERE) is slow  unconditional search can be fast if
records are of fixed length  records do not span blocks:


 

j-th record located by position in block j / bfr average time to find a single record = b / 2
(b = number of blocks)

Retrieval & Update in Sorted Files


 

Quick summary retrieval on key field is fast - next record is nearby any other retrieval either requires a sort, or an index, or is as slow as a heap update, delete, insert are slow (find block, update block, rewrite block)

FAST ACCESS FOR DATABASE: HASHING


 

   

Types of hashing: static or dynamic What is the point of hashing?  reduce a large address space  provide close to direct access  provide reasonable performance for all U,I,D,S What is a hash function?  properties  behaviour Collisions Collision resolution Open addressing Summary

What Is Hashing, and What Is It For?




 

 

direct access to block containing the desired record reduce the number of blocks read or written allow for file expansion and contraction with minimal file reorganising permit retrieval on hashed fields without re-sorting the file reno need to allocate contiguous disk areas if file is small, internal hashing; otherwise external no direct access other than by hashing

A basic example of hashing:




There are 25 rows of seats, with 3 seats per row (75 seats total) We have to allocate each person to a row in advance, at random We will hash on their family name so as to find the persons row number directly, knowing only the name The database is logically a single table ROOM (Name, Age, Attention) (Name, implemented as a blocked, hashed file

The hashing process


ABCDEFGH I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

The hash process is:  Loc = 0  Until no more characters in YourName




Add the alphabetic position of the character to Loc

Calculate RowNum = Loc mod 25

Examples - Hashed Names


AB

C D E F G H I J K L M N O P Q R S T U V W X Y Z

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 12 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 NAME Loc Row Num (mod 25)


REYE ANDERSON MCWILLIAM TANG LEE 53 90 95 42 22 3 15 20 17 22

Where is MCWILLIAM?

Hash(MCWILLIAM) = Row 20

NAME REYE ANDERSON MCWILLIAM TANG LEE

Loc 53 90 95 42 22

RowNum (mod 25) 3 15 20 17 22

Name Hashing Example continued


Name Row Lee 22 West 17 James Anna 5 Anita 18 Jie 24 Kenny 19 Marie 21 Lois 5 Best 21 Rob 10 Lou 23 Axel 17 Name Row Alex Rita 23 Jill Lily Ash Ben Kay Peter Paul Phil Pat Ed 17 23 Jodie 18 8 3 21 12 14 0 20 11 9 Name Row George Guy 3 18 Dave Don 8 Dixy 12 Jon 14 Nina 13 May 14 Max 13 Nora 23 Cash 6 Foot 6 Tan 10 Name Row 7 Will 7 Walt Jack Lana Olga Fred Tania Tom Julia Leah Ling Anne 9 6 Wilf 0 6 0 3 10 8 18 23 3 6 17

The results after 52 arrivals


RowNo #Sitting
0 1 2 3 4 5 6 7 8 9 10 11 12 3 0 0 3 0 2 3 2 3 2 3 1 2 2 1

Alloc +

RowNo #Sitting
13 14 15 16 17 18 19 20 21 22 23 24 2 3 0 0 3 3 1 1 3 1 3 1

Alloc +

1 1

The Room as a Hashed File


      

Each person has a hash key - the name Each person is a record Each row is a hardware block (bucket) Each row number is the address of a bucket Records here are fixed-length (and 3 records per block) fixedThe leftover people are collisions (key collisions) They will have to be found a seat by collision resolution

Collision Resolution


Leave an empty seat in each row  Under population - blocks 66% full A notice on the end of the row: extra seat for row N can be found at the rear exit  buckets overflow chain points to an overflow page containing the record Everyone stand up while we reallocate seats  file reorganisation

Collision Resolution
Strategy 1 (open addressing):
Nora is 4th arrival for 23  Place new arrival in next higher No block with a vacancy


Retrieval - search for Nora:


Retrieve block 23  Read blocks in 23 consecutively. If Nora not found try 24


Disadvantages:
May need to read whole file consecutively on some keys  Blocks will gradually fill up with out-of-place records out-of Deletions cause either immediate or periodic reorganisation


Collision Resolution
Strategy 2:


Reserve some rows (buckets) for overflow Blocks 25, 26 and 27 or recalculate hash function for smaller mod, say 20 instead of 25
 

Julia is then 4th arrival for block 3 Place in overflow block with smaller label and with available space (26 ? and optionally placing a pointer in bucket 3 pointing to 26th). Retrieve block 3 Read blocks in 3 consecutively. If Julia not found, either:
 

Retrieval - search for Julia:


 

search overflow consecutively, or follow pointer to block 26 (chaining)

Disadvantages:
Overflow gradually fills up giving longer retrieval times  Deletions/additions cause periodic reorganisation


Collision Resolution
 

More formally Open addressing: If location specified by hash address is occupied then the subsequent positions are checked in order until an unused (empty) position is found. Chaining: various overflow locations are kept, a pointer field is added to each record location. A collision is resolved by placing the new record in an unused overflow location and setting the pointer of the occupied hash address location to the address of that overflow location. Multiple hashing: A second hash function is applied if the first results in a collision.

Performance on Hashed Files




Retrieve (SELECT): very fast if name is known, otherwise hopeless


SELECT * FROM ROOM WHERE NAME = McWilliam  SELECT * FROM ROOM WHERE AGE > 30


Update: same
UPDATE ROOM SET ATTENTION = low WHERE NAME = McWilliam  UPDATE ROOM SET ATTENTION = high WHERE AGE > 50 OR AGE < 10


Performance on Hashed Files




Delete: same as SELECT, UPDATE


DELETE FROM ROOM (uses hash - fast) WHERE NAME = Nora  DELETE FROM ROOM (cant use hash slow) WHERE NAME IS LIKE No% Insert: unpredictable  INSERT INTO ROOM VALUES (Smyth, high)


Internal Hashing
      

Internal hashing is used as an internal search structure within a program whenever a group of records is accessed exclusively by using the value of one field. Applicable to smaller files Hashed in main memory: fast lookup in store R records, R-length array Hash function transforms key field into subscript array in the range 0 to R - 1 hash (Key Value) = Key Value (mod R) subscript is the record address in store

External Hashing
    

Hashing for disk files is called external hashing. address space is made of buckets, each of which holds multiple records. A bucket is either one disk block or a cluster of contiguous blocks. The hashing function maps a key into a relative bucket number, A table maintained in the file header converts the bucket number into the corresponding disk block address

External Hashing (static)




The hashing scheme is called static hashing if a fixed number of buckets M is allocated. If a record is to be retrieved with search condition specified for the key values, then the bucket number of the bucket potentially containing that record is determined using the hashing function applied on the key and then that bucket is examined for the containment of the desired record. If record is not in that bucket then further search could be activated in overflow buckets.

External Hashing (static)


Construction of hashed file  Identify size of the file, choose hashing function (according to the anticipated number of buckets) and decide about selection of the collision resolution procedure - for the life of the file  Apply hashing function to each inserted record to get the bucket number and place the record in the bucket with that number

External Hashing (static.)




If bucket is full then apply selected collision resolution procedure If the number of records in overflow buckets is large and/or distribution of records in buckets is highly un-uniform , then reorganise the file unusing changed hashing function (tuning)

External Hashing (static)


0
H(key) mod N

Overflow Page

key H

N-1
Primary buckets

Problems of Static Hashing


   

Number of buckets is fixed shrinkage causes wasted space growth causes long overflow chains Solutions:
reorganise  re-hash re use dynamic hashing...


Extendible Hashing


Previously, to insert a new record into a full bucket  add overflow page, or  reorganise by doubling the bucket allocation and redistributing the records This is a poor solution:  entire file is read  twice as many pages have to be written Solution: Extendible Hashing  add a directory of pointers to buckets  double the number of buckets by doubling the directory  split only the bucket that has overflowed

LINEAR HASHING
 

It does not have a directory at all Instead, have a family of algorithms to manage dynamic expansion and contraction of the file Start with a set number of M buckets 0..M-1 with hashing function 0..Mmod M Split them in linear order, when more space is needed. The next hashing function is mod 2M and subsequent 3M, 4M etc as required  Example: Block capacity is 2 records. Records with values 72, 62, 32 are colliding for hashing function (mod 10), but after application of next hashing function (mod 20) they do not (one bucket contains 72 and 32 and another 62). Combines controlled overflow with new space acquisition

Linear Hashing - Advantages


Another type of dynamic hashing  does not require a directory  manages collisions well  accommodates insertions and deletions well  allows overflow chain length to be traded against average space utilisation  uses several hash functions

HASHING SUMMARY
 

  

Comparison of Simple and Hashed Files Pages in a hashed file are grouped into buckets (1 block or a cluster of contiguous blocks) reduce a large address space provide close to direct access Static hashing has some disadvantages which are addressed in dynamic hashing solutions (extendible, linear)

 

Static hashed files are kept at about 80% occupancy then reorganised. New pages are added when each existing page is about 80% full Hence, time to read entire file is } 1.25 nonnonhashed file Dynamic hashing provides flexibility in usage of file storage space (expansion and contraction)

Potrebbero piacerti anche