Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
INTRODUCTION
A File structure is a combination of representations for data in files and of operations for
accessing the data. A File structure application allows us to read, write, and modify data. It might
also support finding the data that matches some search criteria or reading through the data in
some particular order .An improvement in file structure design may make an application
hundreds of times faster. The details of representation of data and the implementation of the
operations determine the efficiency of the file structure for particular application.
The fundamental operation of file systems: open, create, close, read, write, and seek.
Each of these operations involves the creation or use of a link between a physical file stored on a
secondary device and a logical file that represents a program’s more abstract view of the same
file. When the program describes an operation using the logical file name, the equivalent
physical operation gets performed on the corresponding physical file.
Disks are very slow compared to memory. On the other hand, disks provide enormous
capacity at much less cost than memory. They also keep the information stored on them when
they are turned off. The tension between a disk’s relatively slow access time and its enormous,
nonvolatile capacity is the driving force behind file structure design. Good file structure design
will give us access to all the capacity without making our applications spend a lot of time waiting
for the disk. A tremendous variety in the types of data and in the needs of applications makes file
structure design very important.
The problems that researchers struggle will reflect the same issues that one confronts in
addressing any substantial file design problem. Working through the approaches to major file
design issues shows one a lot about how to approach new design problems. Goals of research
and development in file structures are:
Get the information with one access to the disk.
Structures that allow to find the target information with as few accesses as possible.
File structures to group information so that to get everything we need with only one trip
to the disk.
SECTION 1
REQUIREMENTS SPECIFICATIONS
In part 1, we are required to create a student record file. The record consists of the
following fields:
1. University Serial Number
2. Name
3. Address
4. Semester
5. Branch
There should be methods to initialize and assign a record. Also, we should be able to add
a new record, delete a record and modify a record. The number of files is fixed, but the lengths
of the fields are variable.
In the second part, we need to develop a hashed index for the student record file
developed in Part 1. The key for the index is the student USN (University Serial Number). We
need to hash the keys and then store the key- reference pairs for further access. Once we develop
a hashed index, this index is used for the retrieval of records.
We need to provide the following functionalities:
1. Add a record.
2. Delete a record.
3. Modify a record.
Also, we need to demonstrate the doubling of the directory size and the space utilization
of the buckets.
Hardware Requirements:
Software Requirements:
External Interfaces
SECTION 2
INTRODUCTION TO FILE STRUCTURES
Indexing:
Indexing is a way of structuring a file so that records can be found by key. This is an
Alternative to sorting. Unlike sorting, indexing permits us to perform binary searches for keys in
variable-length record files. If the index can be held in memory, record addition, deletion, and
retrieval can be done much more quickly with an indexed, entry-sequenced file than with a
sorted file. Indexes can do much more than merely improve on access time: they can provide us
with new capabilities that are inconceivable with access methods based on sorted data records.
The most exciting new capability involves the use of multiple secondary indexes.
lists; and other times the operation is a combination of matching and merging. These kinds of
operations on sequential lists are the basis of a great deal of file processing.
AVL trees:
It is a self-adjusting binary tree structure. AVL is a height-balanced tree, the allowed
difference between the heights of any two sub trees is one.
The important feature of an AVL tree is:
By setting a maximum allowable difference in the height of any two sub trees,
B-trees:
B-trees are multilevel indexes that solve the problem of linear cost of insertion and
deletion. This is what makes B-trees so good, and why they are now the standard way to
represent indexes. The solution is twofold. First, don’t require that the index records be full.
Second, don’t shift the overflow record into two records, each half full. Deletion takes similar
strategy of merging two records into a single record when necessary.
B+ trees:
The disadvantage of B-tree is that file could not be accessed sequentially with efficiency.
Adding a linked list structure at the bottom level of B-tree solved this problem. The combination
of B-tree and sequential linked list gave rise to B+ trees.
Hashing:
It is a good way of retrieval of records in one access for files that do not change greatly
with time but it does not work will with volatile, dynamic files. A hash function is like a black
box that produces an address every time a key is dropped. Hashing is like indexing in that it
involves associating a key with a relative record address.
SECTION 3
WHY C++?
Object-oriented toolkit:
Making file structures usable in application development requires tuning this conceptual
toolkit into application programming interfaces-collection of data types and operations that can
be used in application. We have chosen to employ object oriented approach in which data types
and operators are presented in a unified fashion as class definitions.
Class Definition
Constructors
Public and private sections
Operator overloading
And the above features enhance the programmer’s ability to control the behaviors of objects.
SECTION 4
PROJECT PART I
Problem Definition:
Design a class called student. Each object of this class represents information about a
single student. Members should be included for student USN (University Serial Number), Name,
Address, Semester, Branch, etc. Methods should be included for initialization, assignment &
modification values. Provide methods to read the member values to the output stream suitably
formatted. Add methods to store objects as records into the files and load the objects from the file
using buffering, design a suitable IOBuffer class hierarchy. Add pack and Unpack methods to
class student. For all the mini projects assume a fixed-filed, variable-length record with delimiter
record structure for the data file.
The part 1 of the project deals with creating a student record file. The record consists of
the following fields as data members.
1. University Serial Number.---->USN
2. Name ---->name
3. Address ---->addr
4. Branch ---->brch
5. Semester. ---->sem
We have provided the following member functions for the operations on the file.
1. Creating a record ---->insert()
2. Assigning a record. ---->assign()
3. Searching a record ---->search()
4. Deleting a record. ---->delet()
5. Modifying a record. ---->modify()
6. Displaying a record ---->display()
Accept
Data
assign() function is used to assign the default value to the data members. Here we
assigned NULL value for all data members as a default value.
search () function is used to search for a record based on key value ( USN ).
Match
delet() function is used to delete a student’s record based on the key value.
Match
modify() function is used to modify the record based on key field entered.
Match
Read and
display the
record
• If the key doesn’t matches check for next record, repeat until eof, then display error
message.
Pack() write()
STORAGE
RAM BUFFER DEVICE
Unpack() read()
The read and write file operations need a buffer, which is developed using a hierarchy of
classes. The highest class in the hierarchy is the class IOBuffer. Since we know the number of
fields and since the lengths of the fields are variable, we use the Delimited Text Buffer class.
Here, we write the length of the record first and then the record itself. The fields are separated
using a delimiter. There are methods that pack the fields into the buffer and there are methods
that unpack the fields from the buffer. The access to the records of the file is sequential. We also
provide for addition of records and deletion of records. The fields of records can be assigned a
specific value and records can also be modified. In general we have the following hierarchy:
• IO BUFFER
• VARIABLE LENGTH BUFFER and FIXED LENGTH BUFFER
• DELIMITED FIELD BUFFER, LENGTH FIELD BUFFER and
FIXED FIELD BUFFER
IOBUFFER
Char array of
Buffer
The field packing and unpacking operations, in their various forms, can be Encapsulated
into C++ classes. The three different field representation strategies are Delimited, length-based
and fixed length is implemented in different classes. Class IO BUFFER does not include any
implementation methods. It is an abstract Class and hence object of it can be declared. All the
necessary read, write pack and unpack operation is provided in classes down the hierarchy.
Inheritance allows related classes to share members. We use this powerful Mechanism
provided by C ++ to buffering. Object-Oriented design of classes Guarantees that operations on
objects are performed correctly.
SECTION 5
PROJECT PART II
Problem Definition:
Develop a hashed index of the student record file with the USN as the key. Write a driver
program to create a hashed file from an existing student record file. Demonstrate the recursive
collapse of directory over more than one level.
The second part of the project deals with providing O (1) access to the records of the file.
For this, we need to develop an index to the file. The USN is used as the key. To provide O (1)
access we need to hash the index. There are two approaches to hashing.
1. Static hashing
2. Dynamic hashing.
Static hashing is very good for the files, which do not undergo any changes frequently.
But real time files change frequently and the performance of static hashing deteriorates.
Dynamic hashing copes with this problem. In this approach, we hash the key and use
only a part of the hashed address. This approach is called “ Use more as we need more”
approach.
We also use what are called “BUCKETS”. Buckets are nothing but containers of key
reference pairs. All the keys in a bucket have same starting address. Once the bucket is full, we
split the bucket into two and distribute the keys among the buckets. To keep track of the buckets,
we develop another structure, a DIRECTORY. A directory maintains an array of the bucket
locations.
Thus, we hash a key and get a part of the hashed address depending on the population of
the records. Then we use this part of the hashed address as an index into the array of buckets and
find its location. We then directly seek to that location and get the record.
The main design issue here is whether we provide a static hashing that uses a prespecified
size of address space or a dynamic hashing. The dynamic hashing is very useful for files that
change frequently.
We have decided to implement extendible hashing, which uses a part of the hashed
address depending on the size of the file. This is called the use-more-as-U-need-more approach.
We do not hash the data file itself. Instead, we only hash the index. The index consists of key-
record address pairs.
Buckets are used to resolve collision problem. Here one address can hold more than one
record or index entry. We also use Directories to keep track of the buckets. The bucket consists
of key-reference pairs. This means that the buffer class that needs to be used is fixed length
buffer. We keep the addresses of the buckets in memory using arrays.
Buckets are filled with key-reference pairs as and when the data records are inserted.
When a bucket gets filled, the bucket is split into two and the records are redistributed. This
means that we are using more of the hashed address as and when the file size increases. Also, we
keep track of deletions. A deletion may trigger the collapse of the directory, as less number of
buckets will be needed. Thus the hashing technique becomes truly dynamic.
directory entry. The directory consists of address to buckets. The bucket in tUSN contains the
address to the address to the record in the STUDENT.DAT file.
The below diagram shows what our project does. The general steps are:
• A given key is hashed to a directory address.
KEY D
S
I BUCKETS T
U
R D
E
E
N
HASH
C BUCKETS T
T F
I
O BUCKETS L
E
R
MakeAddress function extracts a portion of the full hashed address. This function is also
used to reverse the order of the bits in the hashed address, making the lowest–order bit of the
hash address the highest-order bit of the value used in extendible hashing because least
significant integer values tends to have more variation than the high-order bits.
Hash function: retUSNs an integer hash value for key for a 15-bit.
Splitting in Buckets:
Method SPLIT of class Bucket divides keys between an existing bucket and a new
bucket. If necessary, it doubles the size of the directory to accommodate the new bucket.
The INSERT method first searches for the key. SEARCH arranges for the CurrentBucket
member to contain the proper bucket for the key. The FIND method determines where the key
would be if it were in the structure.
The Insert method manages record addition. If the key is already exists, Insert retUSNs
immediately. If the key does not exist, Insert calls Bucket::Insert, for the bucket into which the
key is to be added. If the bucket is full, Bucket::Insert calls Split to handle the task of splitting
the bucket. If the directory needs to be larger, Split calls method Directory::DoubleSize to double
the directory size.
The method works by checking to see whether it is possible for there to be a buddy
bucket. The next test compares the number of bits used by the bucket with the number of bits
used in the directory address space. A pair of buddy buckets is a set of buckets that are
immediately descendents of the same node in the tries. This method retUSNs a buddy bucket or
-1 if none found.
Method Directory::Collapse begins by making sure that we are not at the lower limit of
directory size. By treating the special case of a directory with a single cell here, at the start of the
function, we simplify subsequent processing: with the exception of this case, all directory sizes
are evenly divisible by 2. The test to see if the directory can be collapsed consists of examining
each pair of directory cells to see if they point to different buckets. As soon as we find such a
pair, we know we cannot collapse the directory and method retUSNs
Deletion operations:
We first find the key to be deleted. IF we cannot find it, return failure; if it found call
Bucket::Remove to remove the key from the bucket. Return the value reported back from the
method.
Space utilization:
It is defined as the ratio of actual number of records to the total number of records that
could be stored in allocated space. Expectation of average utilization of 69 %. Space utilization
can be calculated using the formula:
Utilization= (r / b*N)
Where, r is number of records
b is block size, and
N is average number of blocks
Source Code
BucketAddr = newBucketAddr;
Depth++;
NumCells = newSize;
retUSN 1;
}
int Bucket::FindBuddy ()
{
if (Dir.Depth == 0) retUSN -1;
if (Depth < Dir.Depth) retUSN -1;
int sharedAddress = MakeAddress(Keys[0], Depth);
retUSN sharedAddress ^ 1;
}
int Directory :: Collapse()
{
if (Depth == 0) retUSN 0;
for (int i=0;i<NumCells;i+=2)
if(BucketAddr[i] != BucketAddr[i+1])
retUSN 0;
int newSize = NumCells / 2;
int * newAddrs = new int [newSize];
for(int j =0; j<newSize;j++)
newAddrs[j] = BucketAddr[j*2];
delete BucketAddr;
BucketAddr = newAddrs;
Depth --;
collapsetrue=1;
NumCells = newSize;
retUSN 1;
}
int Bucket::TryCombine ()
{
int result;
file>>ch;
if(file.fail())
break;
else if(ch=='#')
numrecs++;
}
file.close();
int cnt=1;
for(int i=0;i<NumCells-1;i++)//counts number of buckets
{
if(BucketAddr[i+1]==BucketAddr[i])
continue;
cnt++;
}
util=(numrecs/(cnt*4))*100;//utilization=r/(bN)
cout<<"\nRECORDS IN THE FILE = "<<numrecs<<"\n";
cout<<"\n\nBUCKETS USED BY THE RECORDS = "<<cnt++;
cout<<"\n\n\nDIRECTORY SIZE IS = "<<NumCells;
cout<<"\n\n\nUTILIZATION OF SPACE = "<<util<<"%\n\n";
//for directory
float x;
x=pow(numrecs,1.25);
x=x*0.98;
cout<<"\nUTILIZATION 0F SPACE BY THE DIRECTORY = "<<x<<"bytes";
}
void Insert(char *myfile)
{
Student s;
char str[30];
setcolor(BLACK);
settextstyle(2,0,5);
{
outtextxy(400,400,"Enter a Valid Key!!!\a");
getch();
return;
}
outtextxy(230,120,"ENTER NAME :");
strget(420,120,s.Name,20);
strupr(s.Name);
int re = Dir.Search(s.Name);
if(re!=-1)
{
outtextxy(400,220,"Name Duplication..!!!");
getch();
}
if((strcmp(s.Name,NULL)==0))
{
outtextxy(400,400,"Enter a Valid NAME!!!\a");
getch();
}
if(!isalpha(s.Name))
{
outtextxy(400,220,"Name Contains other than alpha charector!!");
outtextxy(400,240,"Re-enter NAME");//
getch();
goto NAME;
}
outtextxy(230,140,"ENTER ADDRESS :");
strget(420,140,s.Address,30);
strupr(s.Address);
outtextxy(230,160,"ENTER SEMESTER :");
strget(420,160,s.Semester,2);strupr(s.Semester);
if(atoi(s.Semester)>8)
{
outtextxy(400,400,"Invalid Semester!!!\a");
getch();
return;
}
outtextxy(230,180,"ENTER BRANCH :");
strget(420,180,s.Branch,5);strupr(s.Branch);
int flag=0;
for(int i=0;i<16;i++)
if(strcmp(s.Branch,s.Brlist[i])==0)
{
flag=1;
break;
}
if(flag==0)
{
outtextxy(400,400,"InValid Branch!!!\a");
getch();
return;
}
outtextxy(230,200,"ENTER COLLEGE :");
strget(420,200,s.College,10);
strupr(s.College);
int recaddr=s.Append(myfile);
Dir.Insert(s.Usn,recaddr);
outtextxy(400,400,"Record Successfully Appended.");
getch();
if(doublesizetrue)
{
closegraph();
clrscr();
cprintf("The Directory Has Doubled");
doublesizetrue=0;
Dir.Print(cout);
}
}
strget(200,50,s.Usn,10);
strupr(s.Usn);
int addr=Dir.Search(s.Usn);
if(addr==-1)
{
outtextxy(300,300,"THE RECORD DOES NOT EXIST");
getch();
return;
}
fstream ofile(myfile,ios::in|ios::out);
ofile.seekp(addr,ios::beg);
ofile.write("*",1);
ofile.close();
Dir.Remove(s.Usn);
outtextxy(200,400,"THE RECORD IS DELETED SUCCESSFULLY");
compaction();
getch();
}
void display(char *myfile)
{
Student s;
setcolor(BLACK);
settextstyle(2,0,5);
outtextxy(50,50,"ENTER USN NUMBER : ");
strget(200,50,s.Usn,10);
strupr(s.Usn);
int addr;
if((addr = Dir.Search(s.Usn))==-1)
{
outtextxy(300,300,"Record not found!");
outtextxy(300,320,"Press Any Key..");
getch();
return;
}
DelimFieldBuffer :: SetDefaultDelim('|');
DelimFieldBuffer Buff;
fstream file(myfile,ios::in);
Buff.DRead(file,addr);
s.Unpack(Buff);
char str[100];
sprintf(str,"USN NO : %s",s.Usn);
outtextxy(100,100,str);
sprintf(str,"NAME : %s",s.Name);
outtextxy(100,120,str);
sprintf(str,"ADDRESS : %s",s.Address);
outtextxy(100,140,str);
sprintf(str,"SEMESTER : %s",s.Semester);
outtextxy(100,160,str);
sprintf(str,"BRANCH : %s",s.Branch);
outtextxy(100,180,str);
sprintf(str,"COLLEGE : %s",s.College);
outtextxy(100,200,str);
file.close();
}
SECTION 6
Dept of ISE 29 2007-08
Extendible Hashing Vinayak Hegde Nandikal
GUI DESIGN
SECTION 7
SNAPSHOTS
MAIN MENU
RECORD INSERTION
RECODR MODIFICATION
DISPLAYING A RECORD
SPACE UTILIZATION
DIRECTORY DISPLAY
SECTION 8
CONCLUSION AND FUTURE ENHANCEMENTS
Conclusion:
Hashing is a way of structuring a file so that records can be found by applying a hash
function that transforms a key into address. This address is then used as the basis for insertion
and retrieval of records. Here more than one record can be hashed to the same address, this
phenomenon is called collision. The extendible hashing provides O(1) performance since there is
no overflow. These access time values are truly independent of the size of the file.
Future Enhancement:
Instead of the given STUDENT class, the project can be made to handle a generic class
that accepts a class name as a parameter and used for different applications. Another class called
BUFFERFILE can be included given that it contains a handle to the base class of the buffer class
hierarchy i.e., IOBUFFER and handle to the file for simultaneous manipulation of buffer and file
to support more pure form of OBJECT OREIENTATION.
Some of the possible improvements and new features that can be included are:
Improved User Interface with commercial level enhancements.
Support for remote administration of the system.
Support for simultaneous access and modification of the student file from different
systems.
Improved free space management for data files.
Implementation of other addressing techniques in addition to the present hashing
technique to analyze performance issues.
BIBLIOGRAPHY
TITLE AUTHOR