Sei sulla pagina 1di 59

GROUP ID: 18 SMART SHEET BASED ATTENDANCE SYSTEM 1

Date: 9th November, 2017

UNIVERSITY OF SINDH, JAMSHORO

Title: HUFFMAN CODING ALGORITHM

Presented To:
Miss Syeda Hira Fatima Naqvi.
Presented By:
Sadaf Rasheed (2K15-CSE-72)

Department of Computer Science


INTRODUCTION

Huffman codes are an effective technique of lossless data compression which means no
information is lost.
Huffman coding could perform effective data compression by reducing the
amount of redundancy in the coding of symbols.
Huffman code is method for the compression for standard text documents.
It makes use of a binary tree to develop codes of varying lengths for the letters used in the
original message.
The algorithm was introduced by David Huffman in 1952 as part of a course assignment at
MIT.

HUFFMAN CODING SCHEME 1


CONTD:

Huffman codes can be used to compress information


Like WinZip although WinZip doesnt use the Huffman algorithm
JPEGs do use Huffman as part of their compression process
The basic idea is that instead of storing each character in a file as an 8-bit
ASCII value, we will instead store the more frequently occurring
characters using fewer bits and less frequently occurring characters using
more bits
On average this should decrease the file size (usually )

2
HUFFMAN CODING SCHEME
EXAMPLE

Consider a file of 100,000 characters from af, with these frequencies:


o a = 45,000
o b = 13,000
o c = 12,000
o d = 16,000
o e = 9,000
o f = 5,000
HUFFMAN CODING SCHEME 3
CONTD
(FIXED-LENGTH CODE )

Typically each character in a file is stored as a single byte (8 bits)


If we know we only have six characters, we can use a 3 bit code for the characters instead:
a = 000, b = 001, c = 010, d = 011, e = 100, f = 101
This is called a fixed-length code (If every word in the code has the same length, the code is called a
fixed-length code, or a block code.)
With this scheme, we can encode the whole file with 300,000 bits
(45000*3+13000*3+12000*3+16000*3+9000*3+5000*3)
We can do better
Better compression
More flexibility

HUFFMAN CODING SCHEME 4


Variable length codes (If every word has the different length code, the code is called a fixed-length
code, or a block code) can perform significantly better
Frequent characters are given short code words, while infrequent characters get longer code words
o Consider this scheme:
a = 0; b = 101; c = 100; d = 111; e = 1101; f = 1100
How many bits are now required to encode our file?
45,000*1 + 13,000*3 + 12,000*3 + 16,000*3 + 9,000*4 + 5,000*4 = 224,000 bits
This is in fact an optimal character code for this file

HUFFMAN CODING SCHEME 5


PROBLEMS:
Suppose that we want to encode a message constructed from the symbols A, B, C,
D, and E using a fixed-length code
How many bits are required to encode each symbol?
at least 3 bits are required
2 bits are not enough (can only encode four symbols)
How many bits are required to encode the message DEAACAAAAABA?
there are twelve symbols, each requires 3 bits
12*3 = 36 bits are required

6
HUFFMAN CODING SCHEME
DRAWBACKS OF FIXED-LENGTH CODES:

Wasted space
Unicode uses twice as much space as ASCII
inefficient for plain-text messages containing only ASCII characters
Same number of bits used to represent all characters
a and e occur more frequently than q and z
Potential solution: use variable-length codes
variable number of bits to represent characters when frequency of occurrence is known
short codes for characters that occur frequently

7
HUFFMAN CODING SCHEME
ADVANTAGES OF VARIABLE-LENGTH CODES:
The advantage of variable-length codes over fixed-length is short codes can be given to characters that
occur frequently
on average, the length of the encoded message is less than fixed-length encoding
Potential problem: how do we know where one character ends and another begins?
not a problem if number of bits is fixed!

8
HUFFMAN CODING SCHEME
PREFIX PROPERTY:
Prefix codes
Huffman codes are constructed in such a way that they can be unambiguously
translated back to the original data, yet still be an optimal character code
Huffman codes are really considered prefix codes
A code has the prefix property if no character code is the prefix (start of the
code) for another character code

9
HUFFMAN CODING SCHEME
EXAMPLE (PREFIX):

000 is not a prefix of 11, 01, 001, or 10


11 is not a prefix of 000, 01, 001, or 10

10
HUFFMAN CODING SCHEME
CODE WITHOUT PREFIX PROPERTY:

The following code does not have prefix property

The pattern 1110 can be decoded as QQQP, QTP, QQS, or TS

13

11
HUFFMAN CODING SCHEME
CONTD:

A prefix code is a type of code system (typically a variable-length code) distinguished


by its possession of the "prefix property", which requires that there is no code word in
the system that is a prefix (initial segment) of any other code word in the system.

A prefix code is a uniquely decodable code: a receiver can identify each word
without requiring a special marker between words.

12
HUFFMAN CODING SCHEME
CONTD:

Suppose we have two binary code words a and b, where a is k bits long,
b is n bits long, and k < n. If the first k bits of b are identical to a, then a is
called a prefix of b. The last n k bits of b are called the dangling suffix.

For example, if
a = 010 and b = 01011,
then a is a prefix of b and the dangling suffix is 11.
13
HUFFMAN CODING SCHEME
GROUP ID: 18 SMART SHEET BASED ATTENDANCE SYSTEM 16
PURPOSE OF HUFFMAN CODING:

Proposed by Dr. David A. Huffman in 1952


A Method for the Construction of Minimum Redundancy Codes
Applicable to many forms of data transmission
Our example: text files

14
HUFFMAN CODING SCHEME
THE BASIC ALGORITHM:

Huffman coding is a form of statistical coding


Not all characters occur with the same frequency!
Yet all characters are allocated the same amount of space
1 char = 1 byte, be it e or x

15
HUFFMAN CODING SCHEME
THE BASIC ALGORITHM:

Any savings in tailoring codes to frequency of character?


Code word lengths are no longer fixed like ASCII.
Code word lengths vary and will be shorter for the more
frequently used characters.

16
HUFFMAN CODING SCHEME
THE (REAL) BASIC ALGORITHM:
1. Scan text to be compressed and tally occurrence of all characters.

2. Sort or prioritize characters based on number of occurrences in text.

3. Build Huffman code tree based on prioritized list.

4. Perform a traversal of tree to determine all code words.

5. Scan text again and create new file using the Huffman codes.

17
HUFFMAN CODING SCHEME
ALGORITHM:
n <- |C
Q <- C
for i <- 1 to n-1
do allocate a new node z
left [ z ] <- x <- EXTRACT-MIN (Q)
right [ z ] <- y <- EXTRACT-MIN (Q)
f [ z ] <- f [ x ] + f [ y ]
INSERT(Q, Z)
return EXTRACT-MIN (Q)

18
HUFFMAN CODING SCHEME
ANALYSIS :

Time Complexity
Time complexity of Huffman algorithm is O(nlogn) where each
iteration requires O(logn) time to determine the cheapest
weight and there would be O(n) iterations.

19
HUFFMAN CODING SCHEME
BUILDING A TREE
SCAN THE ORIGINAL TEXT

Consider the following short text:

Eerie eyes seen near lake.

Count up the occurrences of all characters in the text

20
HUFFMAN CODING SCHEME
BUILDING A TREE
SCAN THE ORIGINAL TEXT

Eerie eyes seen near lake.


Q. What characters are present?

E e r i space
ysnarlk.

21
HUFFMAN CODING SCHEME
BUILDING A TREE
SCAN THE ORIGINAL TEXT

Eerie eyes seen near lake.


What is the frequency of each character in the text?

22
HUFFMAN CODING SCHEME
BUILDING A TREE
PRIORITIZE CHARACTERS

Create binary tree nodes with character and frequency


of each character
Place nodes in a priority queue
The lower the occurrence, the higher the priority in the queue

23
HUFFMAN CODING SCHEME
BUILDING A TREE

The queue after inserting all nodes

E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8

24
HUFFMAN CODING SCHEME
BUILDING A TREE
While priority queue contains two or more nodes
Create new node
Dequeue node and make it left subtree
Dequeue next node and make it right subtree
Frequency of new node equals sum of frequency of left and right
children
Enqueue new node back into queue

25
HUFFMAN CODING SCHEME
BUILDING A TREE

E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8

26
HUFFMAN CODING SCHEME
BUILDING A TREE

27
HUFFMAN CODING SCHEME
BUILDING A TREE

28
HUFFMAN CODING SCHEME
BUILDING A TREE

29
HUFFMAN CODING SCHEME
BUILDING A TREE

30
HUFFMAN CODING SCHEME
BUILDING A TREE

31
HUFFMAN CODING SCHEME
BUILDING A TREE

32
HUFFMAN CODING SCHEME
BUILDING A TREE

33
HUFFMAN CODING SCHEME
BUILDING A TREE

34
HUFFMAN CODING SCHEME
BUILDING A TREE

35
HUFFMAN CODING SCHEME
BUILDING A TREE

36
BUILDING A TREE

37
HUFFMAN CODING SCHEME
BUILDING A TREE

38
HUFFMAN CODING SCHEME
BUILDING A TREE

39
HUFFMAN CODING SCHEME
BUILDING A TREE

Q. What is happening to the characters with a low


number of occurrences?

40
HUFFMAN CODING SCHEME
BUILDING A TREE

41
HUFFMAN CODING SCHEME
BUILDING A TREE

42
HUFFMAN CODING SCHEME
BUILDING A TREE

43
HUFFMAN CODING SCHEME
BUILDING A TREE

44
HUFFMAN CODING SCHEME
BUILDING A TREE

45
HUFFMAN CODING SCHEME
BUILDING A TREE

46
HUFFMAN CODING SCHEME
BUILDING A TREE

47
HUFFMAN CODING SCHEME
BUILDING A TREE

After
enqueueing
this node there
is only one
node left in
priority queue.

48
HUFFMAN CODING SCHEME
BUILDING A TREE
Dequeue the single node left in the
queue.
This tree contains the new code
words for each character.
Frequency of root node should equal
number of characters in text.
Eerie eyes seen near lake. 26 characters

49
HUFFMAN CODING SCHEME
ENCODING THE FILE
TRAVERSE TREE FOR CODES

Perform a traversal of the tree to


26
obtain new code words
16
Going left is a 0 going right is a 1 10

4
code word is only completed when a
e 8
6 8

leaf node is reached 2 2 2 sp 4 4


4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2

50
HUFFMAN CODING SCHEME
ENCODING THE FILE
TRAVERSE TREE FOR CODES
Char Code
E 0000
i 0001 26
y 0010
l 0011 16
10
k 0100
. 0101
4
space 011 6
e
8
8
e 10
r 1100 2 2
2 sp 4 4
s 1101 4
n 1110 E i y l k .
1 1 1 1 1 1 n a
a 1111 r s
2 2 2 2

51
HUFFMAN CODING SCHEME
ENCODING THE FILE
Rescan text and encode Char Code
file using new code
E 0000
words i
y
0001
0010
l 0011
Eerie eyes seen near lake. k 0100
000010110000011001110001 . 0101
0101101101001111101011111 space 011
100011001111110100100101
e 10
r 1100
Q. Why is there no need for a
s 1101
n 1110
separator character? a 1111
.

52
HUFFMAN CODING SCHEME
ENCODING THE FILE
RESULTS

Have we made things any better?


73 bits to encode the text 0000101100000110011100010
10110110100111110101111110

ASCII would take 8 * 26 = 208 bits


0011001111110100100101

If modified code used 4 bits per


character are needed. Total bits
4 * 26 = 104. Savings not as great.

53
HUFFMAN CODING SCHEME
APPLICATIONS OF HUFFMAN CODING:

Supports various file type as:


ZIP (multichannel compression including text and other data
types) JPEG
MPEG (only upto 2 layers)
Also used in steganography for JPEG carrier compression.

54
HUFFMAN CODING SCHEME
CONCLUSION:

Like many other useful algorithms we do require Huffman


Algorithm for compression of data so it could be transmitted
over internet and other transmission channels properly.
Huffman algorithm works on Binary trees.

55
HUFFMAN CODING SCHEME
59

Potrebbero piacerti anche