Huffman Code1

Huffman Codes
Huffman code is a technique for compressing data. Huffman's greedy

algorithm look at the occurrence of each character and it as a binary string in
an optimal way.
Example
Suppose we have a data consists of 100,000 characters that we want to
compress. The characters in the data occur with following frequencies.
a b c d e f
Frequency45,00013,00012,00016,0009,000 5,000
Consider the problem of designing a "binary character code" in which each

character is represented by a unique binary string.
Fixed Length Code

In fixed length code, needs 3 bits to represent six(6) characters.
a b c d e f
Frequency 45,000 13,000 12,000 16,000 9,000 5,000
Fixed
Length 000 001 010 011 100 101
code
This method require 3000,000 bits to code the entire file.

How do we get 3000,000?
• Total number of characters are 45,000 + 13,000 + 12,000 + 16,000 +

9,000 + 5,000 = 1000,000.
• Add each character is assigned 3-bit codeword => 3 * 1000,000 =
3000,000 bits.
Conclusion
Fixed-length code requires 300,000 bits while variable code requires
224,000 bits.
=> Saving of approximately 25%.
Prefix Codes
In which no codeword is a prefix of other codeword. The reason prefix
codes are desirable is that they simply encoding (compression) and
decoding.
Can we do better?
A variable-length code can do better by giving frequent characters short

codewords and infrequent characters long codewords.
a b c d e f
Frequency 45,000 13,000 12,000 16,000 9,000 5,000
Fixed
Length 0 101 100 111 1101 1100
code
Character 'a' are 45,000
each character 'a' assigned 1 bit codeword.
1 * 45,000 = 45,000 bits.
Characters (b, c, d) are 13,000 + 12,000 + 16,000 = 41,000

each character assigned 3 bit codeword
3 * 41,000 = 123,000 bits
Characters (e, f) are 9,000 + 5,000 = 14,000

each character assigned 4 bit codeword.
4 * 14,000 = 56,000 bits.
Implies that the total bits are: 45,000 + 123,000 + 56,000 = 224,000 bits
Encoding: Concatenate the codewords representing each characters of

the file.
String Encoding
TEA 10 00 010
SEA 011 00 010
TEN 10 00 110
Example From variable-length codes table, we code the3-character

file abc as:
a b c
=> 0.101.100 =
0 101 100
0101100
Decoding
Since no codeword is a prefix of other, the codeword that begins an encoded
file is unambiguous.
To decode (Translate back to the original character), remove it from the
encode file and repeatedly parse.
For example in "variable-length codeword" table, the string 001011101
parse uniquely as 0.0.101.1101, which is decode to aabe.
The representation of "decoding process" is binary tree, whose leaves are

characters. We interpret the binary codeword for a character as path from the
root to that character, where 0 means "go to the left child" and 1 means "go
to the right child". Note that an optimal code for a file is always represented
by a full (complete) binary tree.
Theorem A Binary tree that is not full cannot

correspond to an optimal prefix code.
Proof Let T be a binary tree corresponds to prefix code such that T is

not full. Then there must exist an internal node, say x, such that x has only
one child, y. Construct another binary tree, T`, which has save leaves as T
and have same depth as T except for the leaves which are in the subtree
rooted at y in T. These leaves will have depth in T`, which implies T cannot
correspond to an optimal prefix code.
To obtain T`, simply merge x and y into a single node, z is a child of parent
of x (if a parent exists) and z is a parent to any children of y. Then T` has the
desired properties: it corresponds to a code on the same alphabet as the code
which are obtained, in the subtree rooted at y in T have depth in T` strictly
less (by one) than their depth in T.
This completes the proof. □
a b c d e f
Frequency 45,000 13,000 12,000 16,000 9,000 5,000
Fixed Length
000 001 010 011 100 101
code
Variable-
0 101 100 111 1101 1100
length Code
Fixed-length code is not optimal since binary tree is not full.
Figure
Optimal prefix code because tree is full binary
Figure
From now on consider only full binary tree
If C is the alphabet from which characters are drawn, then the tree for an
optimal prefix code has exactly |c| leaves (one for each letter) and exactly |
c|-1 internal orders. Given a tree T corresponding to the prefix code,
compute the number of bits required to encode a file. For each character c in
C, let f(c) be the frequency of c and let dT(c) denote the depth of c's leaf.
Note that dT(c) is also the length of codeword. The number of bits to encode
a file is
B (T) = S f(c) dT(c)
which define as the cost of the tree T.

For example, the cost of the above tree is
B (T) = S f(c) dT(c)

= 45*1 +13*3 + 12*3 + 16*3 + 9*4 +5*4
= 224
Therefore, the cost of the tree corresponding to the optimal prefix code is
224 (224*1000 = 224000).
Constructing a Huffman code

A greedy algorithm that constructs an optimal prefix code called a Huffman
code. The algorithm builds the tree T corresponding to the optimal code in a
bottom-up manner. It begins with a set of |c| leaves and perform |c|-1
"merging" operations to create the final tree.
Data Structure used: Priority queue = Q

Huffman (c)
n = |c|
Q=c
for i =1 to n-1
do z = Allocate-Node ()
x = left[z] = EXTRACT_MIN(Q)
y = right[z] = EXTRACT_MIN(Q)
f[z] = f[x] + f[y]
INSERT (Q, z)
return EXTRACT_MIN(Q)
Analysis
• Q implemented as a binary heap.
• line 2 can be performed by using BUILD-HEAP (P. 145; CLR) in
O(n) time.
• FOR loop executed |n| - 1 times and since each heap operation
requires O(lg n) time.
=> the FOR loop contributes (|n| - 1) O(lg n)
=> O(n lg n)
• Thus the total running time of Huffman on the set of n characters is
O(nlg n).
Operation of the Algorithm
An optimal Huffman code for the following set of frequencies

a:1 b:1 c:2 d:3 e:5 g:13 h:2
Note that the frequencies are based on Fibonacci numbers.
Since there are letters in the alphabet, the initial queue size is n = 8, and 7
merge steps are required to build the tree. The final tree represents the
optimal prefix code.
Figure
The codeword for a letter is the sequence of the edge labels on the path from
the root to the letter. Thus, the optimal Huffman code is as follows:
h:1
g:1 0
f: 1 1 0
e:1 1 1 0
d:1 1 1 1 0
c:1 1 1 1 1 0
b:1 1 1 1 1 1 0
a:1 1 1 1 1 1 1
As we can see the tree is one long limb with leaves n=hanging off. This is
true for Fibonacci weights in general, because the Fibonacci the recurrence
is
Fi+1 + Fi + Fi-1 implies that i Fi = Fi+2 - 1.
To prove this, write Fj as Fj+1 - Fj-1 and sum from 0 to i, that is, F-1 = 0.
Correctness of Huffman Code Algorithm

Proof Idea
Step 1: Show that this problem satisfies the greedy choice property, that is,
if a greedy choice is made by Huffman's algorithm, an optimal solution
remains possible.
Step 2: Show that this problem has an optimal substructure property, that is,
an optimal solution to Huffman's algorithm contains optimal solution to
subproblems.
Step 3: Conclude correctness of Huffman's algorithm using step 1 and step

2.
Lemma - Greedy Choice Property Let c be an

alphabet in which each character c has frequency f[c].
Let x and y be two characters in C having the lowest
frequencies. Then there exists an optimal prefix code for
C in which the codewords for x and y have the same
length and differ only in the last bit.
Proof Idea
Take the tree T representing optimal prefix code and transform T into a
tree T` representing another optimal prefix code such that the x characters
x and y appear as sibling leaves of maximum depth in T`. If we can do
this, then their codewords will have same length and differ only in the last
bit.
Figures
Proof
Let characters b and c are sibling leaves of maximum depth in tree T.
Without loss of generality assume that f[b] ≥ f[c] and f[x] ≤ f[y].
Since f[x] and f[y] are lowest leaf frequencies in order and f[b] and f[c]
are arbitrary frequencies in order. We have f[x] ≤ f[b] and f[y] ≤
f[c]. As shown in the above figure, exchange the positions of leaves to get
first T` and then T``. By formula, B(t) = c in C f(c)dT(c), the
difference in cost between T and T` is
B(T) - B(T`) = f[x]dT(x) + f(b)dT(b) - [f[x]dT(x) +

f[b]dT`(b)
= (f[b] - f[x]) (dT(b) - dT(x))
= (non-negative)(non-negative)
≥0
Two Important Points

The reason f[b] - f[x] is non-negative because x is a minimum frequency
leaf in tree T and the reason dT(b) - dT(x) is non-negative that b is
a leaf of maximum depth in T.
Similarly, exchanging y and c does not increase the cost which implies that
B(T) - B(T`) ≥ 0. This fact in turn implies that B(T) ≤ B(T`), and
since T is optimal by supposition, which implies B(T`) = B(T).
Therefore, T`` is optimal in which x and y are sibling leaves of maximum
depth, from which greedy choice property follows. This complete the proof.
□
Lemma - Optimal Substructure Property Let T be a
full binary tree representing an optimal prefix code over
an alphabet C, where frequency f[c] is define for each
character c belongs to set C. Consider any two
characters x and y that appear as sibling leaves in the
tree T and let z be their parent. Then, considering
character z with frequency f[z] = f[x] + f[y], tree T` = T
- {x, y} represents an optimal code for the alphabet C` =
C - {x, y}U{z}.
Proof Idea
Figure
Proof
We show that the cost B(T) of tree T can be expressed in terms of the
cost B(T`). By considering the component costs in equation, B(T) =
f(c)dT(c), we show that the cost B(T) of tree T can be expressed in
terms of the cost B(T`) of the tree T`. For each c belongs to C - {x, y},
we have dT(c) = dT(c)
f[c]dT(c) = ScinC-{x,y} f[c]dT`(c)

cinC
= f[x](dT` (z) + 1 + f[y] (dT`(z) +1)
= (f[x] + f[y]) dT`(z) + f[x] + f[y]
And B(T) = B(T`) + f(x) + f(y)
If T` is non-optimal prefix code for C`, then there exists a T`` whose
leaves are the characters belongs to C` such that B(T``) < B(T`). Now,
if x and y are added to T`` as a children of z, then we get a prefix code for
alphabet C with cost B(T``) + f[x] + f[y] < B(T), Contradicting the
optimality of T. Which implies that, tree T` must be optimal for the
alphabet C. □
Theorem Procedure HUFFMAN produces an optimal

prefix code.
Proof
Let S be the set of integers n ≥ 2 for which the Huffman procedure
produces a tree of representing optimal prefix code for frequency f and
alphabet C with |c| = n.
If C = {x, y}, then Huffman produces one of the following optimal trees.
figure
This clearly shows 2 is a member of S. Next assume that n belongs to S and

show that (n+1) also belong to S.
Let C is an alphabet with |c| = n + 1. By lemma 'greedy choice property',
there exists an optimal code tree T for alphabet C. Without loss of generality,
if x and y are characters with minimal frequencies then
a. x and y are at maximal depth in tree T and b. and y have a common
parent z.
Suppose that T` = T - {x,y} and C` = C - {x,y}U{z} then by lemma-optimal
substructure property (step 2), tree T` is an optimal code tree for C`. Since |
C`| = n and n belongs to S, the Huffman code procedure produces an optimal
code tree T* for C`. Now let T** be the tree obtained from T* by attaching x
and y as leaves to z.
Without loss of generality, T** is the tree constructed for C by the Huffman
procedure. Now suppose Huffman selects a and b from alphabet C in first
step so that f[a] not equal f[x] and f[b] = f[y]. Then tree C constructed by
Huffman can be altered as in proof of lemma-greedy-choice-property to give
equivalent tree with a and y siblings with maximum depth. Since T` and T*
are both optimal for C`, implies that B(T`) = B(T*) and also B(T**) = B(T)
why? because
B(T**) = B(T*) - f[z]dT*(z) + [f[x] + f[y]] (dT*(z) + 1)]

= B(T*) + f[x] + f[y]
Since tree T is optimal for alphabet C, so is T** . And T** is the tree
constructed by the Huffman code.
And this completes the proof. □
Theorem The total cost of a tree for a code can be

computed as the sum, over all internal nodes, of the
combined frequencies of the two children of the node.
Proof
Let tree be a full binary tree with n leaves. Apply induction hypothesis on
the number of leaves in T. When n=2 (the case n=1 is trivially true), there
are two leaves x and y (say) with the same parent z, then the cost of T is
B(T) = f(x)dT(x) + f[y]dT(y)

= f[x] + f[y] since dT(x) = dT(y) =1
= f[child1 of z] + f[child2 of z].
Thus, the statement of theorem is true. Now suppose n>2 and also suppose
that theorem is true for trees on n-1 leaves.
Let c1 and c2 are two sibling leaves in T such that they have the same parent
p. Letting T` be the tree obtained by deleting c1 and c2, we know by
induction that
B(T) = leaves l` in T`f[l`]dT(l`)

= internal nodes i` in T` f[child1of i`] + f [child2 of i`]
Using this information, calculates the cost of T.

B(T) = leaves l in T f[l]dT(l)
= l not= c1, c2 f[l]dT(l) + f[c1]dT (c1) -1) + f[c2]dT (c2)
-1) + f[c1]+ f[c2]
= leaves l` in T` f[l]dT`(l) + f[c1]+ f[c2]
= internal nodes i` in T` f[child1of i`] + f [child2 of i`] +
f[c1]+ f[c2]
= internal nodes i in T f[child1of i] + f[child1of i]
Thus the statement is true. And this completes the proof.
The question is whether Huffman's algorithm can be generalize to handle

ternary codewords, that is codewords using the symbols 0,1 and 2. Restate
the question, whether or not some generalized version of Huffman's
algorithm yields optimal ternary codes? Basically, the algorithm is similar to
the binary code example given in the CLR-text book. That is, pick up three
nodes (not two) which have the least frequency and form a new node with
frequency equal to the summation of these three frequencies. Then repeat
the procedure. However, when the number of nodes is an even number, a
full ternary tree is not possible. So take care of this by inserting a null node
with zero frequency.
Correctness
Proof is immediate from the greedy choice property and an optimal
substructure property. In other words, the proof is similar to the correctness
proof of Huffman's algorithm in the CLR.

Huffman Code1

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Huffman Code1

Caricato da

Copyright:

Formati disponibili

Huffman Codes

Huffman code is a technique for compressing data. Huffman's greedy

Consider the problem of designing a "binary character code" in which each

Fixed Length Code

This method require 3000,000 bits to code the entire file.

• Total number of characters are 45,000 + 13,000 + 12,000 + 16,000 +

=> Saving of approximately 25%.

A variable-length code can do better by giving frequent characters short

Characters (b, c, d) are 13,000 + 12,000 + 16,000 = 41,000

Characters (e, f) are 9,000 + 5,000 = 14,000

Encoding: Concatenate the codewords representing each characters of

Example From variable-length codes table, we code the3-character

The representation of "decoding process" is binary tree, whose leaves are

Theorem A Binary tree that is not full cannot

Proof Let T be a binary tree corresponds to prefix code such that T is

Fixed-length code is not optimal since binary tree is not full.

Optimal prefix code because tree is full binary

From now on consider only full binary tree

B (T) = S f(c) dT(c)

which define as the cost of the tree T.

B (T) = S f(c) dT(c)

Constructing a Huffman code

Data Structure used: Priority queue = Q

Operation of the Algorithm

An optimal Huffman code for the following set of frequencies

Fi+1 + Fi + Fi-1 implies that i Fi = Fi+2 - 1.

Correctness of Huffman Code Algorithm

Step 3: Conclude correctness of Huffman's algorithm using step 1 and step

Lemma - Greedy Choice Property Let c be an

B(T) - B(T`) = f[x]dT(x) + f(b)dT(b) - [f[x]dT(x) +

Two Important Points

f[c]dT(c) = ScinC-{x,y} f[c]dT`(c)

Theorem Procedure HUFFMAN produces an optimal

This clearly shows 2 is a member of S. Next assume that n belongs to S and

B(T**) = B(T*) - f[z]dT*(z) + [f[x] + f[y]] (dT*(z) + 1)]

Theorem The total cost of a tree for a code can be

B(T) = f(x)dT(x) + f[y]dT(y)

B(T) = leaves l` in T`f[l`]dT(l`)

Using this information, calculates the cost of T.

The question is whether Huffman's algorithm can be generalize to handle

Potrebbero piacerti anche

B(T**) = B(T) - f[z]dT(z) + [f[x] + f[y]] (dT*(z) + 1)]