Huffman Coding Scheme

GROUP ID: 18 SMART SHEET BASED ATTENDANCE SYSTEM 1
Date: 9th November, 2017
UNIVERSITY OF SINDH, JAMSHORO
Title: HUFFMAN CODING ALGORITHM
Presented To:
Miss Syeda Hira Fatima Naqvi.
Presented By:
Sadaf Rasheed (2K15-CSE-72)
Department of Computer Science

INTRODUCTION
Huffman codes are an effective technique of lossless data compression which means no
information is lost.
Huffman coding could perform effective data compression by reducing the
amount of redundancy in the coding of symbols.
Huffman code is method for the compression for standard text documents.
It makes use of a binary tree to develop codes of varying lengths for the letters used in the
original message.
The algorithm was introduced by David Huffman in 1952 as part of a course assignment at
MIT.
HUFFMAN CODING SCHEME 1

CONTD:
Huffman codes can be used to compress information

Like WinZip although WinZip doesnt use the Huffman algorithm
JPEGs do use Huffman as part of their compression process
The basic idea is that instead of storing each character in a file as an 8-bit
ASCII value, we will instead store the more frequently occurring
characters using fewer bits and less frequently occurring characters using
more bits
On average this should decrease the file size (usually )
2
HUFFMAN CODING SCHEME
EXAMPLE
Consider a file of 100,000 characters from af, with these frequencies:

o a = 45,000
o b = 13,000
o c = 12,000
o d = 16,000
o e = 9,000
o f = 5,000
CONTD
(FIXED-LENGTH CODE )
Typically each character in a file is stored as a single byte (8 bits)

If we know we only have six characters, we can use a 3 bit code for the characters instead:
a = 000, b = 001, c = 010, d = 011, e = 100, f = 101
This is called a fixed-length code (If every word in the code has the same length, the code is called a
fixed-length code, or a block code.)
With this scheme, we can encode the whole file with 300,000 bits
(45000*3+13000*3+12000*3+16000*3+9000*3+5000*3)
We can do better
Better compression
More flexibility

Variable length codes (If every word has the different length code, the code is called a fixed-length
code, or a block code) can perform significantly better
Frequent characters are given short code words, while infrequent characters get longer code words
o Consider this scheme:
a = 0; b = 101; c = 100; d = 111; e = 1101; f = 1100
How many bits are now required to encode our file?
45,000*1 + 13,000*3 + 12,000*3 + 16,000*3 + 9,000*4 + 5,000*4 = 224,000 bits
This is in fact an optimal character code for this file

PROBLEMS:
Suppose that we want to encode a message constructed from the symbols A, B, C,
D, and E using a fixed-length code
How many bits are required to encode each symbol?
at least 3 bits are required
2 bits are not enough (can only encode four symbols)
How many bits are required to encode the message DEAACAAAAABA?
there are twelve symbols, each requires 3 bits
12*3 = 36 bits are required
6
DRAWBACKS OF FIXED-LENGTH CODES:
Wasted space
Unicode uses twice as much space as ASCII
inefficient for plain-text messages containing only ASCII characters
Same number of bits used to represent all characters
a and e occur more frequently than q and z
Potential solution: use variable-length codes
variable number of bits to represent characters when frequency of occurrence is known
short codes for characters that occur frequently
7
ADVANTAGES OF VARIABLE-LENGTH CODES:
The advantage of variable-length codes over fixed-length is short codes can be given to characters that
occur frequently
on average, the length of the encoded message is less than fixed-length encoding
Potential problem: how do we know where one character ends and another begins?
not a problem if number of bits is fixed!
8
PREFIX PROPERTY:
Prefix codes
Huffman codes are constructed in such a way that they can be unambiguously
translated back to the original data, yet still be an optimal character code
Huffman codes are really considered prefix codes
A code has the prefix property if no character code is the prefix (start of the
code) for another character code
9
EXAMPLE (PREFIX):
000 is not a prefix of 11, 01, 001, or 10

11 is not a prefix of 000, 01, 001, or 10
10
CODE WITHOUT PREFIX PROPERTY:
The following code does not have prefix property
The pattern 1110 can be decoded as QQQP, QTP, QQS, or TS
13
11
CONTD:
A prefix code is a type of code system (typically a variable-length code) distinguished

by its possession of the "prefix property", which requires that there is no code word in
the system that is a prefix (initial segment) of any other code word in the system.
A prefix code is a uniquely decodable code: a receiver can identify each word
without requiring a special marker between words.
12
CONTD:
Suppose we have two binary code words a and b, where a is k bits long,
b is n bits long, and k < n. If the first k bits of b are identical to a, then a is
called a prefix of b. The last n k bits of b are called the dangling suffix.
For example, if
a = 010 and b = 01011,
then a is a prefix of b and the dangling suffix is 11.
13
GROUP ID: 18 SMART SHEET BASED ATTENDANCE SYSTEM 16
PURPOSE OF HUFFMAN CODING:
Proposed by Dr. David A. Huffman in 1952

A Method for the Construction of Minimum Redundancy Codes
Applicable to many forms of data transmission
Our example: text files
14
THE BASIC ALGORITHM:
Huffman coding is a form of statistical coding

Not all characters occur with the same frequency!
Yet all characters are allocated the same amount of space
1 char = 1 byte, be it e or x
15
THE BASIC ALGORITHM:
Any savings in tailoring codes to frequency of character?

Code word lengths are no longer fixed like ASCII.
Code word lengths vary and will be shorter for the more
frequently used characters.
16
THE (REAL) BASIC ALGORITHM:
1. Scan text to be compressed and tally occurrence of all characters.
2. Sort or prioritize characters based on number of occurrences in text.
3. Build Huffman code tree based on prioritized list.
4. Perform a traversal of tree to determine all code words.
5. Scan text again and create new file using the Huffman codes.
17
ALGORITHM:
n <- |C
Q <- C
for i <- 1 to n-1
do allocate a new node z
left [ z ] <- x <- EXTRACT-MIN (Q)
right [ z ] <- y <- EXTRACT-MIN (Q)
f [ z ] <- f [ x ] + f [ y ]
INSERT(Q, Z)
return EXTRACT-MIN (Q)
18
ANALYSIS :
Time Complexity
Time complexity of Huffman algorithm is O(nlogn) where each
iteration requires O(logn) time to determine the cheapest
weight and there would be O(n) iterations.
19
BUILDING A TREE
SCAN THE ORIGINAL TEXT
Consider the following short text:
Eerie eyes seen near lake.
Count up the occurrences of all characters in the text
20
BUILDING A TREE

Q. What characters are present?
E e r i space
ysnarlk.
21
BUILDING A TREE

What is the frequency of each character in the text?
22
BUILDING A TREE
PRIORITIZE CHARACTERS
Create binary tree nodes with character and frequency

of each character
Place nodes in a priority queue
The lower the occurrence, the higher the priority in the queue
23
BUILDING A TREE
The queue after inserting all nodes
E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
24
BUILDING A TREE
While priority queue contains two or more nodes
Create new node
Dequeue node and make it left subtree
Dequeue next node and make it right subtree
Frequency of new node equals sum of frequency of left and right
children
Enqueue new node back into queue
25
BUILDING A TREE
E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
26
BUILDING A TREE
27
BUILDING A TREE
28
BUILDING A TREE
29
BUILDING A TREE
30
BUILDING A TREE
31
BUILDING A TREE
32
BUILDING A TREE
33
BUILDING A TREE
34
BUILDING A TREE
35
BUILDING A TREE
36
BUILDING A TREE
37
BUILDING A TREE
38
BUILDING A TREE
39
BUILDING A TREE
Q. What is happening to the characters with a low

number of occurrences?
40
BUILDING A TREE
41
BUILDING A TREE
42
BUILDING A TREE
43
BUILDING A TREE
44
BUILDING A TREE
45
BUILDING A TREE
46
BUILDING A TREE
47
BUILDING A TREE
After
enqueueing
this node there
is only one
node left in
priority queue.
48
BUILDING A TREE
Dequeue the single node left in the
queue.
This tree contains the new code
words for each character.
Frequency of root node should equal
number of characters in text.
Eerie eyes seen near lake. 26 characters
49
ENCODING THE FILE
TRAVERSE TREE FOR CODES
Perform a traversal of the tree to

26
obtain new code words
16
Going left is a 0 going right is a 1 10
4
code word is only completed when a
e 8
6 8
leaf node is reached 2 2 2 sp 4 4

4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2
50
ENCODING THE FILE
TRAVERSE TREE FOR CODES
Char Code
E 0000
i 0001 26
y 0010
l 0011 16
10
k 0100
. 0101
4
space 011 6
e
8
8
e 10
r 1100 2 2
2 sp 4 4
s 1101 4
n 1110 E i y l k .
1 1 1 1 1 1 n a
a 1111 r s
2 2 2 2
51
ENCODING THE FILE
Rescan text and encode Char Code
file using new code
E 0000
words i
y
0001
0010
l 0011
Eerie eyes seen near lake. k 0100
000010110000011001110001 . 0101
0101101101001111101011111 space 011
100011001111110100100101
e 10
r 1100
Q. Why is there no need for a
s 1101
n 1110
separator character? a 1111
.
52
ENCODING THE FILE
RESULTS
Have we made things any better?

73 bits to encode the text 0000101100000110011100010
10110110100111110101111110
ASCII would take 8 * 26 = 208 bits

0011001111110100100101
If modified code used 4 bits per

character are needed. Total bits
4 * 26 = 104. Savings not as great.
53
APPLICATIONS OF HUFFMAN CODING:
Supports various file type as:

ZIP (multichannel compression including text and other data
types) JPEG
MPEG (only upto 2 layers)
Also used in steganography for JPEG carrier compression.
54
CONCLUSION:
Like many other useful algorithms we do require Huffman

Algorithm for compression of data so it could be transmitted
over internet and other transmission channels properly.
Huffman algorithm works on Binary trees.
55
59

Huffman Coding Scheme

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Huffman Coding Scheme

Caricato da

Copyright:

Formati disponibili

GROUP ID: 18 SMART SHEET BASED ATTENDANCE SYSTEM 1

Date: 9th November, 2017

UNIVERSITY OF SINDH, JAMSHORO

Title: HUFFMAN CODING ALGORITHM

Department of Computer Science

HUFFMAN CODING SCHEME 1

Huffman codes can be used to compress information

Consider a file of 100,000 characters from af, with these frequencies:

Typically each character in a file is stored as a single byte (8 bits)

HUFFMAN CODING SCHEME 4

HUFFMAN CODING SCHEME 5

000 is not a prefix of 11, 01, 001, or 10

The following code does not have prefix property

The pattern 1110 can be decoded as QQQP, QTP, QQS, or TS

A prefix code is a type of code system (typically a variable-length code) distinguished

Proposed by Dr. David A. Huffman in 1952

Huffman coding is a form of statistical coding

Any savings in tailoring codes to frequency of character?

2. Sort or prioritize characters based on number of occurrences in text.

3. Build Huffman code tree based on prioritized list.

4. Perform a traversal of tree to determine all code words.

Consider the following short text:

Eerie eyes seen near lake.

Count up the occurrences of all characters in the text

Eerie eyes seen near lake.

Eerie eyes seen near lake.

Create binary tree nodes with character and frequency

The queue after inserting all nodes

Q. What is happening to the characters with a low

Perform a traversal of the tree to

leaf node is reached 2 2 2 sp 4 4

Have we made things any better?

ASCII would take 8 * 26 = 208 bits

If modified code used 4 bits per

Supports various file type as:

Like many other useful algorithms we do require Huffman

Potrebbero piacerti anche