Sei sulla pagina 1di 15

Linked file allocation

I Node
Hash table
Hash map
Binary search algorithm

LINKED ALLOCATION
The problems in contiguous allocation can be traced directly to the requirement that
the spaces be allocated contiguously and that the files that need these spaces are of
different sizes. These requirements can be avoided by using linked allocation.
In linked allocation, each file is a linked list of disk blocks. The directory contains a
pointer to the first and (optionally the last) block of the file. For example, a file of 5
blocks which starts at block 4, might continue at block 7, then block 16, block 10, and
finally block 27. Each block contains a pointer to the next block and the last block
contains a NIL pointer. The value -1 may be used for NIL to differentiate it from block
0.
With linked allocation, each directory entry has a pointer to the first disk block of the
file. This pointer is initialized to nil (the end-of-list pointer value) to signify an empty
file. A write to a file removes the first free block and writes to that block. This new
block is then linked to the end of the file. To read a file, the pointers are just
followed from block to block.
There is no external fragmentation with linked allocation. Any free block can be used
to satisfy a request. Notice also that there is no need to declare the size of a file
when that file is created. A file can continue to grow as long as there are free blocks.
Linked allocation, does have disadvantages, however. The major problem is that it is
inefficient to support direct-access; it is effective only for sequential-access files. To
find the ith block of a file, it must start at the beginning of that file and follow the
pointers until the ith block is reached. Note that each access to a pointer requires a
disk read.
Another severe problem is reliability. A bug in OS or disk hardware failure might result
in pointers being lost and damaged. The effect of which could be picking up a wrong
pointer and linking it to a free block or into another file.
What is an INODE?

Inode is a data structure that keeps track of all the information about a file. You keep your
nformation in a file and the OS stores the information about a file in an inode. Information about
files is sometimes called metadata. We can say that an inode is metadata of the data.
Whenever you need to access a file, OS first looks for the exact and unique inode in a table
called inode table. In fact, the application or the user who accesses a file, reaches the file with

the help of the inode number provided by the inode table. To get to a particular file by its name,
the OS needs an inode number corresponding to that file. Now, to reach an inode number you
dont need to know the file name. In fact, by knowing the inode number of a file you can access
the data stored in that file.
An interesting fact is that the total number of inodes is created when a file system is set up. This
means there is a limit to the number of inodes you can have in your file system. After that limit
is reached, you wont be able to create any more files, even if you have space left on the
partition!
How does the structure of an inode look like?

A directory with its corresponding inodes looks like this:


4204851
4203424
4195429
4205752

./
../
dir1/
dir2/

Inodes

Directory/file names

2
3

4204851

./

4203424

../

4195429

dir1/

4205752

dir2/

4205722

file1

4205723

file2

4194941

.hidden_dir/

10

4205589

.hidden_file1

Inode structure of a directory consists of a name to inode mapping of files and directories in that
directory.
In the above example, you probably noticed the first two entries of ./ [dot] and ../ [dot dot].
You might have seen them whenever you list the contents of a directory. You might also know
that executing the command

cd .

cd .

will change the directory to the current directory itself and the command
cd ..

cd ..

will take you to the previous directory (the parent directory of the current directory). Why that
happens?
Lets use an example. Lets have a look at our test directory:
user@machine ~/tmp/test % ls
4204851 ./
4205723 file2
4203424 ../
4195429 folder
4205722 file1
4205752 folde

user@machine ~/tmp/test % ls -i

4204851 ./

4205723 file2

4203424 ../

4195429 folder1/

4205722 file1

4205752 folder2/

Lets have a look at the inode numbers of . (dot) and .. (dot dot):
. (dot) = 4204851
.. (dot dot) = 4203424

Well do the same for ~/tmp/ directory and note the inodes there:
user@machine ~/tmp % ls -i
4203424 ./
3816836 ../
4204851 test/

user@machine ~/tmp % ls -i

4203424 ./

3816836 ../

4204851 test/

The inode numbers of ~/tmp/test/ directory and . (dot) from the directory listing of ~/tmp:
. (dot) = 4203424
~/tmp/test/ = 4204851

You can see that inode number of . (dot) inside ~/tmp/test/ directory is equal to inode of
test directory 4204851. And inode of .. (dot dot) inside ~/tmp/test/ is equal to inode of
. (dot) inside ~/tmp/ directory 4203424.
The . (dot) always means the current directory as its inode is same as the directory inode.
Similarly, .. (dot dot) corresponds to parent directory inode as its inode is same as the
previous (parent) directory.
Inode Structure of a File

When we use the stat command on a file, we get some interesting information about the file:
Access: (0600/-rw -------) Uid:
Access: 2013-11-26 12:29:08.0
Modify: 2013-11-26 12:29:10.70
Change: 2013-11-26 12:29:41.1
Birth: -

1 user@machine ~/tmp/test % stat file1


2

File: `file1'

Size: 8944

Blocks: 24

IO Block: 4096 regular file

4 Device: fe00h/65024d Inode: 4205722 Links: 1


5 Access: (0600/-rw-------) Uid: ( 1000/ user) Gid: ( 1000/ users)
6 Access: 2013-11-26 12:29:08.019301643 +0000
7 Modify: 2013-11-26 12:29:10.707321276 +0000
8 Change: 2013-11-26 12:29:41.123544208 +0000
9 Birth:

File: This is information about the file name.

Size: The size of the file in terms of bytes.

Blocks: Gives us information about number of blocks occupied by the file. A


block is the minimal size that can be written to disk. On most Linux systems
the block size is 1024 bytes. Even if the size of a file is smaller then this
block, a full block is needed. You can check your block size by executing the
command: dumpe2fs /dev/sda1 | grep "^Block size:", where sda1 is the
device you want to check. The direct block is the pointer of the first block, or
header, of the physical file. The indirect blocks are a listing of every block
that contains a portion of the file (not the header).

I/O Blocks: This is a hint as to the best unit size for I/O operations its
usually the unit of allocation on the physical disk. Dont get confused
between the IO block and the block that stat uses to indicate physical size;
the blocks for physical size are always 512 bytes.

regular file: This is the information about a type of a file. It can be a regular
file (as in our case), empty regular file, directory.

Device: Gives us information about the physical device the file is located on
in decimal and hex numbers.

Inode: is the inode number of a file.

Links: Gives us data on any links made to the file, including symbolic links.

Owner Info: Access rights, owner of the file, group of the file, etc.

Time Stamps: It stores the inode last access, creation and modification
times.

Birth: Is the birth time of a specific file the moment when it was created
on the file system. - or 0 if unknown.

How to check Inode Utilization?

Inode utilization can be checked by using the df -i command:


user@machine ~ % df -i
Filesystem
Inodes IUs
rootfs
5840896 659
udev
217137 40

user@machine ~ % df I

Filesystem

rootfs

Inodes IUsed

IFree IUse% Mounted on

5840896 659354 5181542 12% /

udev

217137

406 216731

1% /dev

tmpfs

220131

450 219681

1% /run

tmpfs

220131

6 220125

tmpfs

220131

11 220120

tmpfs

220131

6 220125

tmpfs

220131

11 220120

10 /dev/sda1

124496

1% /dev/shm
1% /sys/fs/cgroup
1% /run/lock

265 124231

1% /run/user
1% /boot

As you can see, the maximum number of inodes that can be created on our rootfs is 5840896.
Anything over that number will not be created.
How to access a file using inode?

Apart from performing actions on a file using its name, you can also modify your filesystem
using inodes. This is especially useful if you dont have access to GUI (or simply dont want to
waste time starting visual interface) and come across a file with special characters in the
filename that are not present on the keyboard. To get a listing of all the files in a directory and
their inode number we can use the ls -i command:
user@machine ~/tmp/test % ls
4195429 dir1/ 4205752 dir2/ 4

1 user@machine ~/tmp/test % ls -i
2 4195429 dir1/ 4205752 dir2/ 4205722 file1 4205723 file2

or, as mentioned earlier, ls -ia (this will list the hidden files as well):
user@machine ~/tmp/test % ls
4204851 ./ 4205752 dir2/ 42
4203424 ../ 4205678 file 4
4195429 dir1/ 4205722 file1 4

1 user@machine ~/tmp/test % ls -ai


2 4204851 ./

4205752 dir2/ 4205723 file2

3 4203424 ../

4205678 file 4194941 .hidden_dir/

4 4195429 dir1/ 4205722 file1 4205589 .hidden_file1

As you can see, our test directory contains a file with special characters (UNICODE) file.
You wouldnt be able to access the file by its filename, as your keyboard most definitely doesnt
have the micro character defined. You could access the file in several ways. The harder one,
would be to find the UNICODE hex code for the micro sign and this would involve looking for
the UNICODE characters table. The easier way would be to use the file inode. However, the
most obvious and easiest way to deal with such signs would be to use tab-completion, yet not
always a file could be deleted by its name. Sometimes the filename contains broken characters,
not recognized by the system. Here inode comes in hand. You can access and modify (delete,
edit, move, etc.) the file using its inode:
# edit the file w ith vim
user@machine ~/tmp/test % fin
# remove the file

1 # edit the file with vim


2 user@machine ~/tmp/test % find ./ -inum 4195034 -exec vim {} \;
3
4 # remove the file
5 user@machine ~/tmp/test % find ./ -inum 4195034 -exec rm {} \;
6
7 # change the file name
8 user@machine ~/tmp/test % find ./ -inum 4195034 -exec mv {} repaired_file \;
9 ...

Just insert the file inode number after the -inum option and action to be performed after -exec
option. The braces {} correspond to the inode number defined in the command. Remember to
include \; (backslash semicolon) at the end of your command.
You can also change directory using inode.
To get the inode numbers of the directories, you can use the command:
user@machine ~ % tree -a -L 1
/
+ - - [3276801] bin
+- - [
2] boot

user@machine ~ % tree -a -L 1 --inodes /

[3276801] bin

[ 1026] dev

[1179649] etc

[3538945] home

[5505025] lib

10

[2752513] media

11

[3670017] mnt

12

[1703937] opt

13

14

[3145729] root

15

[ 1143] run

16

[5373953] sbin

17

[ 131073] selinux

18

[1310721] srv

19

20

[5111809] tmp

21

[3014657] usr

22

[1572865] var

2] boot

11] lost+found

1] proc

1] sys

23
24

20 directories

The inode number for the specified directory in the brackets. To access a directory use the
command:
user@machine ~ % cd $(find /

user@machine ~ % cd $(find / -inum 1572865)

or
user@machine ~ % find -inum 1

1 user@machine ~ % find -inum 1572865 -exec cd {} \;

This command might sometimes take a while to finish and show some errors (caused by find
command), especially if you wish to enter a directory outside your working directory, so dont
panic if nothing happens. Like removing files with an inode number, you can also delete
directory with its inode number.
Inode numbers are unique, but you may have noticed that some file names and inode number
listings do show the same number. The duplication is caused by hard links. Hard links are made
when a file is copied in multiple directories. The same file exists in various directories on the
same storage unit. The directory listing shows two files with the same inode. This links them to
the same physical address on the storage unit. Hard links allow for the same file to exist in
multiple directories, but only one physical file exists on a disk. Thanks to this, some space is
saved on the drive.
Deleting files causes the size and direct/indirect block entries to be zeroed and the physical space
on the storage unit is set as unused. To undelete the file, the metadata is restored from the
journal if it is used.
Hash Tables
Linked lists are handy ways of tying data structures together, but navigating linked lists can be
inefficient. If you were searching for a particular element, you might easily have to look at the
whole list before you find the one that you need. Linux uses another technique, hashing, to get
around this restriction. A hash table is an array or vector of pointers. An array, or vector, is
simply a set of things coming one after another in memory. A bookshelf could be said to be an
array of books. Arrays are accessed by an index, which is an offset into the array's associated
area in memory. Taking the bookshelf analogy a little further, you could describe each book by
its position on the shelf; you might ask for the 5th book.
A hash table is an array of pointers to data structures and its index is derived from information in
those data structures. If you had data structures describing the population of a village then you
could use a person's age as an index. To find a particular person's data you could use their age as
an index into the population hash table and then follow the pointer to the data structure
containing the person's details. Unfortunately many people in the village are likely to have the
same age and so the hash table pointer becomes a pointer to a chain or list of data structures each
describing people of the same age. However, searching these shorter chains is still faster than

searching all of the data structures.


As a hash table speeds up access to commonly used data structures, Linux often uses hash tables
to implement caches. Caches are handy information that needs to be accessed quickly and are
usually a subset of the full set of information available. Data structures are put into a cache and
kept there because the kernel often accesses them. The drawback to caches is that they are more
complex to use and maintain than simple linked lists or hash tables. If the data structure can be
found in the cache (this is known as a cache hit), then all well and good. If it cannot then all of
the relevant data structures must be searched and, if the data structure exists at all, it must be
added into the cache. In adding new data structures into the cache an old cache entry may need
discarding. Linux must decide which one to discard, the danger being that the discarded data
structure may be the next one that Linux needs.

What is a Hash Table?

As we saw with binary search, certain data structures such as a binary search tree can help
improve the efficiency of searches. From linear search to binary search, we improved our search
efficiency from O(n) to O(logn) . We now present a new data structure, called a hash table, that
will increase our efficiency to O(1) , or constant time.
A hash table is made up of two parts: an array (the actual table where the data to be searched is
stored) and a mapping function, known as a hash function. The hash function is a mapping from
the input space to the integer space that defines the indices of the array. In other words, the hash
function provides a way for assigning numbers to the input data such that the data can then be
stored at the array index corresponding to the assigned number.
Let's take a simple example. First, we start with a hash table array of strings (we'll use strings as
the data being stored and searched in this example). Let's say the hash table size is 12:

Figure %: The empty hash table of strings


Next we need a hash function. There are many possible ways to construct a hash function. We'll
discuss these possibilities more in the next section. For now, let's assume a simple hash function
that takes a string as input. The returned hash value will be the sum of the ASCII characters that
make up the string mod the size of the table:
int hash(char *str, int table_size)
{
int sum;
/* Make sure a valid string passed in */
if (str==NULL) return -1;
/* Sum up all the characters in the string */
for( ; *str; str++) sum += *str;

/* Return the sum mod the table size */


return sum % table_size;

Now that we have a framework in place, let's try using it. First, let's store a string into the table:
"Steve". We run "Steve" through the hash function, and find that hash("Steve",12) yields 3:

Figure %: The hash table after inserting "Steve"


Let's try another: "Notes". We run "Notes" through the hash function and find that
hash("Notes",12) is 3. Ok. We insert it into the hash table:

Figure %: A hash table collision

What happened? A hash function doesn't guarantee that every input will map to a different output
(in fact, as we'll see in the next section, it shouldn't do this). There is always the chance that two
inputs will hash to the same output. This indicates that both elements should be inserted at the
same place in the array, and this is impossible. This phenomenon is known as a collision.
There are many algorithms for dealing with collisions, such as linear probing an d separate
chaining. While each of the methods has its advantages, we will only discuss separate chaining
here.
Separate chaining requires a slight modification to the data structure. Instead of storing the data
elements right into the array, they are stored in linked lists. Each slot in the array then points to
one of these linked lists. When an element hashes to a value, it is added to the linked list at that
index in the array. Because a linked list has no limit on length, collisions are no longer a
problem. If more than one element hashes to the same value, then both are stored in that linked
list.
Let's look at the above example again, this time with our modified data structure:

Figure %: Modified table for separate chaining


Again, let's try adding "Steve" which hashes to 3:

Figure %: After adding "Steve" to the table


And "Spark" which hashes to 6:

Figure %: After adding "Spark" to the table


Now we add "Notes" which hashes to 3, just like "Steve":

Figure %:
Collision solved "Notes" added to
Once we have our
table populated, a
follows the same
doing an insertion.
hash the data we're
searching for, go to
place in the array,
down the list
originating from
location, and see if
we're looking for is
list. The number of
O(1) .

table
hash
search
steps as
We
that
look
that
what
in the
steps is

Separate chaining allows us to solve the problem of collision in a simple yet powerful manner.
Of course, there are some drawbacks. Imagine the worst case scenario where through some fluke
of bad luck and bad programming, every data element hashed to the same value. In that case, to
do a lookup, we'd really be doing a straight linear search on a linked list, which means that our
search operation is back to being O(n) . The worst case search time for a hash table is O(n) .
However, the probability of that happening is so small that, while the worst case search time is
O(n) , both the best and average cases are O(1) .

Potrebbero piacerti anche