Sei sulla pagina 1di 5

HUFFMAN CODING

In computer science and information theory, a Huffman code is a particular type of


optimal prefix code that is commonly used for lossless data compression. The process of
finding and/or using such a code proceeds by means of Huffman coding, an algorithm
developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the
1952 paper "A Method for the Construction of Minimum-Redundancy Codes".
The output from Huffman's algorithm can be viewed as a variable-length code table for
encoding a source symbol (such as a character in a file). The algorithm derives this table
from the estimated probability or frequency of occurrence (weight) for each possible value
of the source symbol. As in other entropy encoding methods, more common symbols are
generally represented using fewer bits than less common symbols. Huffman's method can
be efficiently implemented, finding a code in time linear to the number of input weights if
these weights are sorted.
Huffman coding is a lossless data compression algorithm. The idea is to assign variable-
length codes to input characters, lengths of the assigned codes are based on the
frequencies of corresponding characters. The most frequent character gets the smallest
code and the least frequent character gets the largest code.
The variable-length codes assigned to input characters are Prefix Codes, means the
codes (bit sequences) are assigned in such a way that the code assigned to one character
is not prefix of code assigned to any other character. This is how Huffman Coding makes
sure that there is no ambiguity when decoding the generated bit stream.
Let us understand prefix codes with a counter example. Let there be four characters a, b,
c and d, and their corresponding variable length codes be 00, 01, 0 and 1. This coding
leads to ambiguity because code assigned to c is prefix of codes assigned to a and b. If
the compressed bit stream is 0001, the de-compressed output may be “cccd” or “ccb” or
“acd” or “ab”.

See this  for applications of Huffman Coding.


There are mainly two major parts in Huffman Coding
1) Build a Huffman Tree from input characters.
2) Traverse the Huffman Tree and assign codes to characters.

Steps to build Huffman Tree

Input is array of unique characters along with their frequency of occurrences and output is
Huffman Tree.
1. Create a leaf node for each unique character and build a min heap of all leaf nodes (Min
Heap is used as a priority queue. The value of frequency field is used to compare two
nodes in min heap. Initially, the least frequent character is at root)
2. Extract two nodes with the minimum frequency from the min heap.
3. Create a new internal node with frequency equal to the sum of the two nodes
frequencies. Make the first extracted node as its left child and the other extracted node as
its right child. Add this node to the min heap.
4. Repeat steps#2 and #3 until the heap contains only one node. The remaining node is
the root node and the tree is complete.
Example: MISSISSIPPI RIVER

Assuming we have to send a string of data , say "mississippi river" . If we are sending it
in the general way each letters will take 8bits. Here there are 17 letters for this
word.hence it will take a total of 136 bits (17 * 8) .

now coming to huffman's method , here initialy we are finding the relative frequencies of each
letters . Considering our example . We have to note that in “mississipi river”, space between these
two words is also considered as a letter. Hence here am using “_” to represent the white space :

m ----> 1
i ----> 5
s ----> 4
p ----> 2
r ----> 2
v ----> 1
e ----> 1
_ ----> 1

now we have to assign codes. And as a first step, we have to sort these letters in the
frequency of its occurence. That is, “i” in this example comes first with occurence 5.
similarly s(4) and so on. So the respective sorted order will be :

i5 s4 p2 r2 m1 v1 e1 _1

now our next step towards assigning codes is to add the respective frequencies. We start
from the words with the shortest frequencies and we build the tree with the above data as
leaf nodes. So here first we add

e2 here first we added e and space “ _ ” and their frequencies . And we got e_2 ,
where 2 is their added frequency.

e1 _ 1
similarly we add (v1, m1 ) (p2 , r2 ) (i5 , s4) to get vm2 , pr4 , is9 .

vm2 pr2 is9

V1 m1 p2 r2 i5 s4

similarly we keep adding to the top as shown in the figure below to construct the tree.
After constructing the tree we label the left branch with a “0” and right branch with “1” :

isprmve_17

prmve_8

0 1

mve_4
0

0
1

is9 pr4 mv2 e_2


0 1 1
0 0 1 0 1

i5 s4 p2 r2 m1 v1 e1 _1

now the tree structure is complete. from this tree we get the codes for each letter .
For that we have to view the tree from the top to leaf for each repective letter. For eg
taking the letter i. Starting from the root “ isprmve_ “ it goes to “ is9 ” and then reaches
leaf i5 through the branch 00 . thus assigned code fr the letter “a” is 00 .

similarly we get codes for each respective letters by traversing the tree.

m ----> 1 ----> 1100


i ----> 5 ----> 00
s ----> 4 ---- > 01
p ----> 2 ---- > 100
r ----> 2 ----> 101
v ----> 1 ----> 1101
e ----> 1----> 1110
_ ----> 1----> 1111
and checking the obtained result we can see that the letter with least frequencies lik “v” and “e”
need more bits to represent than letters with high frequency like “i” and “s” (2 bits
each) .
So now when we try to send the word “mississippi river” it needs only 46 bits instead of 136
bits

m i s s i s s i p p i “_” r i v e r
1100 00 01 01 00 01 01 00 100 100 00 1111 101 00 1101 1110 101 = 46

concluding we can say huffman algorithm compresses the bits required to represent
a text data in an efficient way.

The following post will be explaining steps to implement huffman using python
and java script based on this algorithm..

Practical usage:
Huffman is widely used in all the mainstream compression formats that you might
encounter - from GZIP, PKZIP (winzip etc) and BZIP2, to image formats such as JPEG
and PNG.

All compression schemes have pathological data-sets that cannot be meaningfully


compressed; the archive formats I listed above simply 'store' such files uncompressed
when they are encountered.

Newer arithmetic and range coding schemes are often avoided because of patent


issues, meaning Huffman remains the work-horse of the compression industry.

Potrebbero piacerti anche