Sei sulla pagina 1di 12

Outline

3.1 Overview 3.2 The Huffman Coding Algorithm 3.2.1 Minimum Variance Huffman Codes 3.2.2 Optimality of Huffman Codes (*) 3.2.3 Length of Huffman Codes (*) 3.2.4 Extended Huffman Codes (*) 3.3 Nonlinear Huffman Codes (*) 3.4 Adaptive Huffman Coding 3.4.1 Update Procedure 3.4.2 Encoding Procedure 3.4.3 Decoding Procedure
Ch 3 Huffman Coding 2

Chapter 3 HUFFMAN CODING


Yeuan-Kuen Lee [ MCU, CSIE ]

Outline
3.5 Golomb Codes 3.6 Rice Codes 3.6.1 CCSDS Recommendation for Lossless Compression 3.7 Tunstall Codes 3.8 Applications of Huffman Coding 3.8.1 Lossless Image Compression 3.8.2 Text Compression 3.8.3 Audio Compression 3.9 Summary 3.10 Projects and Problems

3.1 Overview
In this chapter, we describe a very popular coding algorithm called the Huffman coding algorithm: Present a procedure for building Huffman codes when the probability model for the source is known. A procedure for building codes when the source statistics are unknown Describe a new technique for code design that are in some sense similar to the Huffman coding approach Some applications

Ch 3 Huffman Coding

Ch 3 Huffman Coding

3.2 The Huffman Coding Algorithm


This technique was developed by David Huffman as part of a class assignment; the class was the first ever in the area of information theory and was taught by Robert Fano at MIT. The Codes generated using this technique are called Huffman Codes. These Codes are Prefix codes Optimum for a given model ( set of probabilities ) Based on two observations regarding optimum prefix codes: 1. In an optimum code, symbols that occurs more frequently ( have a higher probability of occurrence) will have short codewords than symbols that occur less frequently. 2. In an optimum code, the two symbols that occur least frequently will have the same length.

3.2 The Huffman Coding Algorithm


In an optimum code, the two symbols that occur least frequently will have the same length. Suppose an optimum code C exists in which the two codewords corresponding to the two least probable symbols do not have the same length. Suppose the longer codeword is k bits longer than the shorter codeword.

distinct

k bits

As these codewords correspond to the least probable symbols in the alphabet, no other codeword can be longer than these codewords; therefore there is no danger that the shortened codeword would become the prefix of some other codeword. Ch 3 Huffman Coding 6

Ch 3 Huffman Coding

3.2 The Huffman Coding Algorithm


Furthermore, by dropping these k bits we obtain a new code that has a shorter average length than C. But, this violates our initial contention that C is an optimal code. Therefore, for an optimal code the second observation also holds true. A simple requirement A simple requirement The codewords corresponding to the two lowest probability symbols differ only in the last bit. That is, if and are the two least probable symbols in an alphabet, if the codeword for was m 0, the codeword for would be m 1. Here, m is a string of 1s and 0s, and denotes concatenation.

3.2 The Huffman Coding Algorithm


Example 3.2.1 Design of a Huffman Code Example 3.2.1 Design of a Huffman Code An alphabet A = { a1 , a2 , a3 , a4 , a5 } with P( a1 ) = P( a3 ) = 0.2 P( a2 ) = 0.4 P( a4 ) = P( a5 ) = 0.1 The entropy = -2 * 0.2 log2 (0.2) - 0.4 log2 (0.4) - 2 * 0.1 log2 (0.1) = 2.122 bits/symbol Table 3.1 The initial five-letter alphabet Letter a2 a1 a3 a4 a5 7 Ch 3 Huffman Coding Probability 0.4 0.2 0.2 0.1 0.1 Codeword c(a2) c(a1) c(a3) c(a4) c(a5) The two symbols with the lowest probability are a4 and a5. c(a4) = 1 0 c(a5) = 1 1 1 is a binary string

Ch 3 Huffman Coding

3.2 The Huffman Coding Algorithm


Define a new alphabet A = { a1 , a2 , a3 , a4 } where a4 is composed of a 4 and a 5 . P( a4 ) = P( a4 ) + P( a5 ) = 0.2 Table 3.2 The reduced four-letter alphabet Letter a2 a1 a3 a4 Probability 0.4 0.2 0.2 0.2 Codeword c(a2) c(a1) c(a3) 1 In this alphabet A , a3 and a4 are the two letters at the bottom of the sorted list. We assign their codewords as c(a3) = 2 0 c(a4) = 2 1

3.2 The Huffman Coding Algorithm


We again define a new alphabet A = { a1 , a2 , a3 } where a3 is composed of a 3 and a 4 . P( a3 ) = P( a3 ) + P( a4 ) = 0.4 Table 3.3 The reduced three-letter alphabet Letter a2 a3 a1 Probability 0.4 0.4 0.2 Codeword c(a2) 2 c(a1)

In this case, the least probable symbols are a3 and a1 . Therefore, c(a3) = 3 0 c(a1) = 3 1

but c(a4) = 1 . Therefore, 1 = 2 1 Which mean that Ch 3 Huffman Coding

c(a4) = 1 0 = 2 10 c(a5) = 1 1 = 2 11 9

but c(a3) = 2 . Therefore, 2 = 3 0 Which mean that

c(a3) = 2 0 = 3 00 c(a4) = 2 10 = 3 010 c(a5) = 2 11 = 3 011 10

Ch 3 Huffman Coding

3.2 The Huffman Coding Algorithm


We again define a new alphabet A = {a3 , a2 } where a3 is composed of a3 and a 1 . P( a3 ) = P( a3 ) + P( a1 ) = 0.6 Table 3.4 The reduced two-letter alphabet Letter a3 a2 Probability 0.6 0.4 Codeword 3 c(a2) We have only two letters, The codeword assignment is straightforward: c(a3) = 0 c(a2) = 1

3.2 The Huffman Coding Algorithm


Table 3.5 Huffman code for the original five-letter alphabet Letter a2 a1 a3 a4 a5 Probability 0.4 0.2 0.2 0.1 0.1 Codeword 1 01 000 0010 0011

The average length for this code is l = .4*1 + .2*2 + .2*3 + .1*4 + .1*4 = 2.2 bits/symbol. A measure of the efficiency of this code is its redundancy the difference between the entropy and the average length. In this case, the redundancy = 2.2 2.122 = 0.078 bits/symbol. 11 Ch 3 Huffman Coding 12

but c(a3) = 3 . Therefore, 3 = 0 Which mean that

c(a1) = 3 1 = 01 c(a3) = 3 00 = 000 c(a4) = 3 010 = 0010 c(a5) = 3 011 = 0011

Ch 3 Huffman Coding

3.2 The Huffman Coding Algorithm


Sorted by probabilities

3.2 The Huffman Coding Algorithm


We build the binary tree starting at the leaf nodes. (0.4) (0.2) (0.2) 0 1 (0.2) 0 1 0 (0.4) (0.4) (0.2) 0 1 (0.4) 1 0 (0.6) (0.4) root 0 1 (1.0) 0 (0.6) 1 1

a2 (0.4) a1 (0.2) a3 (0.2) a4 (0.1) a5 (0.1) 0 1

a2 (0.4) a1 (0.2) a3 (0.2) a4 (0.2) 0 1

a2 (0.4) a3 (0.4) a 1 (0.2) 0 1

a3 (0.6) 0 a2 (0.4) 1

a2 (0.4) a1 (0.2) a3 (0.2) a4 (0.1) a5 (0.1)

a2
(0.4)

a1
(0.2) (0.2) 1

Figure 3.1 The Huffman encoding procedure. The symbol probabilities are list in parentheses.

Figure 3.2 Building the binary Huffman tree. Notice the similarity between Figures 3.1 and 3.2. This is not surprising, as they are a result of viewing the same procedure in two different ways. 13 Ch 3 Huffman Coding

(0.2)

a3

a4
(0.1)

a5
(0.1) 14

Ch 3 Huffman Coding

3.2.1 Minimum Variance Huffman Codes


Table 3.2 Reduced four-letter alphabet Letter a2 a1 a3 a4 Probability 0.4 0.2 0.2 0.2 Codeword c(a2) c(a1) c(a3) 1

3.2.1 Minimum Variance Huffman Codes


Table 3.7 Reduced three-letter alphabet. Letter a1 a2 a4 Probability 0.4 0.4 0.2 Codeword 2 c(a2) 1

Table 3.6 Reduced four-letter alphabet Letter a2 a4 a1 a3 Probability 0.4 0.2 0.2 0.2 Codeword c(a2) 1 c(a1) c(a3) 15 Table 3.8 Reduced two-letter alphabet. Letter a2 a1 Probability 0.6 0.4 Codeword 3 2

Ch 3 Huffman Coding

Ch 3 Huffman Coding

16

3.2.1 Minimum Variance Huffman Codes


Table 3.9 Minimum variance Huffman code Letter a1 a2 a3 a4 a5 Probability 0.2 0.4 0.2 0.1 0.1 Codeword 10 00 11 010 011

3.2.1 Minimum Variance Huffman Codes


Sorted by probabilities

a2 (0.4) a1 (0.2) a3 (0.2) a4 (0.1) a5 (0.1) 0 1

a2 (0.4) a4 (0.2) a1 (0.2) a3 (0.2) 0 1

a1 (0.4) a2 (0.4) a 4 (0.2) 0 1

a2 (0.6) a1 (0.4)

0 1

The average length for this code is l = .4*2 + .2*2 + .2*2 + .1*3 + .1*3 = 2.2 bits/symbol. These two codes are identical in terms of their redundancy. However, the variance of the length of the codewords is significantly different.

Figure 3.3 The minimum variance Huffman encoding procedure.

Ch 3 Huffman Coding

17

Ch 3 Huffman Coding

18

3.2.1 Minimum Variance Huffman Codes


root 0 (0.6) 1 0 (0.4) 1 0 1 0 (0.6) 0 1 0 (0.2) 1 root 1 (0.4) 0

3.4 Adaptive Huffman Coding


Two parameters are added to the binary tree: 1. Weight external node, leaf internal node 1 2. Node number: unique An alphabet of size n, 2n-1 node (internal + external) Node number Weight : y1, y2, y3, y(2n-1) : x1, x2, x3, x(2n-1) , x1 x2 x3 x(2n-1) The number of times the symbol has been encountered Sum of the weight of its offspring

3
5

a2
(0.4)

a1
(0.2) (0.2) 1

(0.4)

a2

a1
(0.2)

a3
(0.2)

a3
(0.2)

a4
(0.1)

a5
(0.1)

a4
(0.1)

a5
(0.1)

minimum variance

Sibling property : nodes y(2j-1) and y(2j) are sibling for 1 j < n node number for the parent number is greater than y(2j-1) and y(2j) 19 Ch 3 Huffman Coding 20

Figure 3.4 Two Huffman trees corresponding to the same probabilities. Ch 3 Huffman Coding

3.4 Adaptive Huffman Coding


symbols transmitter transmitter codes 01101

3.4 Adaptive Huffman Coding


Before the beginning of transmission, a fixed code for each symbol is agreed upon between transmitter and receiver. receiver receiver If the source has an alphabet ( a1, a2, a3, , am ) of size m , then pick e and r such that m = 2 e + r and 0 r < 2 e . ex: m = 26, 26 = 24 + 10, e = 4 , r = 10

initial tree NYT

initial tree NYT

The letter ak is encoded as (e+1)-bit binary representation of k-1 e-bit binary representation of k-r-1 , ex: a1 a2 a22 21 [ 1 2*10 ] [ 2 2*10 ] [22 > 2*10 ] 1-1 2-1 22-10-1 if 1 k 2r otherwise

As transmission progresses, nodes corresponding to symbols transmitted will be added to the tree, and the tree is reconfigured using a update procedure.

00000 (5 bits) 00001 (5 bits) 1011 (4 bits) 22

Ch 3 Huffman Coding

Ch 3 Huffman Coding

3.4 Adaptive Huffman Coding


When a symbol is encountered for the first time, 1. 2. 3. 4. The code for the NYT node is transmitted Followed by the fixed code for the symbol A node for the symbol is created The symbol is taken out of the NYT list.

3.4.1 Update Procedure


The update procedure requires the nodes be in a fixed order. This ordering is preserved by numbering the node. The largest node number is given to the root of the tree, and the smallest number is assigned to the NYT node. The number from the NYT node to the root are assigned in increasing order from left to right, and from lower to upper level. The set of nodes with the same weight make up a block. The function of the update procedure is to preserve the sibling property.

Both transmitter and receiver Start with the same tree structure Update procedure is identical Therefore, the encoding and decoding processes remain synchronized.

Ch 3 Huffman Coding

23

Ch 3 Huffman Coding

24

3.4.1 Update Procedure


START Figure 3.6 (a) Update Procedure NYT gives birth to new NYT and external node Increment weight of external node and old NYT node Go to old NYT node B Ch 3 Huffman Coding Yes First appearance for symbol? No Go to symbol external node C Node number max in block ? Yes A 25 No switch node with highest numbered node in block

3.4.1 Update Procedure


A B Increment node weight C

Is this the root node ? Yes STOP Figure 3.6 (b) Update Procedure Ch 3 Huffman Coding

No

Go to parent node

26

3.4.1 Update Procedure


Example 3.4.1 Update Procedure Example 3.4.1 Update Procedure Message [ a a r d v a r k], where the alphabet consists of the 26 lowercase letters of the English alphabet. Total number of node = 2 * 26 1 = 51. root root NYT 2 0 1 NYT 0 a 49 ( aa ) a Send 1 for the second a 27 51 1 1 0 NYT 51

3.4.1 Update Procedure


root root 0 NYT 0 49 a r Send 0 for NYT node, then send the fixed code 10001 for r Since the index of r is 18 So, the fixed code is 10001 (17) Ch 3 Huffman Coding ( aa ) 2 51 1 Old NYT a NYT 0 47 1 49 3 51

2
50

2
50

1
48 r ( aar )

0
51

2
50

initial tree

0
49

1
50

(a) a Send a binary code 00000 for a Since the index of a is 1 Ch 3 Huffman Coding

update the tree for r

28

3.4.1 Update Procedure


root 3 51 1 49 NYT 0 47 ( aar ) d Send 00 for NYT node, then send the fixed code 00011 for d Since the index of d is 4 So, the fixed code is 00011 (3) Ch 3 Huffman Coding root 4 51 2 49 Old NYT 1 47 NYT 0 45

3.4.1 Update Procedure


root 4 51 2 49 1 47 NYT root 4 51

2
50

2
50

2
50

a 1 47 1 45 NYT

2 49 B Old NYT

2
50

1 r
48

1
48

1
48

1
48

1
46

0
45

1
46

1
46

Swap nodes

( aard ) update the tree for d 29

( aard ) v Send 000 for NYT node, then send the fixed code 1011 for v Since the index of v is 22 So, the fixed code is 1011 (11) Ch 3 Huffman Coding

0
43

1
44

v ( aardv )

update the tree for v 30

3.4.1 Update Procedure


root 4 51 2 49 root 5 51

3.4.2 Encoding Procedure


Figure 3.8 (a) flowchart of the encoding procedure START

Read in Symbol 3 50

2
50 2 48

2 a
49

1 r
47 1 45 NYT

Swap nodes

1 r
47 1 45 NYT

2 48

Send code for NYT node followed by index in the NYT list

Yes

First appearance for symbol?

No

Code is the path from the root node to the corresponding node

1
46

1
46

d Call update procedure A 31 Ch 3 Huffman Coding 32

0
43

1
44

0
43

1
44

( aardv ) Ch 3 Huffman Coding

( aardv )

3.4.2 Encoding Procedure


Figure 3.8 (b) flowchart of the encoding procedure A

3.4.2 Encoding Procedure


Example 3.4.2 Encoding procedure Example 3.4.2 Encoding procedure

Message [

Is this the last symbol? Yes START

No

0 0 0 0 0 1

0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 0

NYT

NYT

NYT

Ch 3 Huffman Coding

33

Ch 3 Huffman Coding

34

3.4.3 Decoding Procedure


START B Go to root of the tree Figure 3.9 (a) flowchart of the decoding procedure

3.4.3 Decoding Procedure


A Figure 3.9 (b) flowchart of the decoding procedure Yes

Is the node the NYT node? No

Read e bit

Is the node an external node? Yes A

No

Read bit and go to corresponding node

Decode element corresponding to node C

Is the e-bit number p less than r ? Yes Read one more bit

No

Add r to p

D Ch 3 Huffman Coding 35 Ch 3 Huffman Coding 36

3.4.3 Decoding Procedure


C D

3.4.3 Decoding Procedure


Example 3.4.3 Decoding procedure Example 3.4.3 Decoding procedure NYT NYT NYT

Call update procedure

Decode the (p+1) element in NYT list

0 0 0 0 0 1
B No Is this the last bit ? Yes START Figure 3.9 (c) flowchart of the decoding procedure Message [

0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 0

Ch 3 Huffman Coding

37

Ch 3 Huffman Coding

38

3.8 Applications of Huffman Coding 3.8.1 Lossless Image Compression

3.8.1 Lossless Image Compression


Table 3.23 Compression using Huffman codes on pixel values. Image Name Sena Sensin Earth Omaha Bits/Pixel 7.01 7.49 4.94 7.12 Total Size(bytes) Compression Ratio 57,504 61,430 40,534 58,374 1.14 1.07 1.62 1.12

Table 3.24 Compression using Huffman codes on pixel difference values. Sena Sensin Earth Omaha Image Name Sena Sensin Earth Omaha 39 Ch 3 Huffman Coding Bits/Pixel 4.02 4.70 4.13 6.42 Total Size(bytes) Compression Ratio 32,968 38,541 33,880 52,643 1.99 1.70 1.93 1.24 40

256*256 Gray scale raw image. Figure 3.10 Test Images. ftp://ftp.mkp.com/pub/Sayood/uncompressed_software/datasets/images/ Ch 3 Huffman Coding

3.8.1 Lossless Image Compression


Table 3.25 Compression using adaptive Huffman codes on pixel difference values. Image Name Sena Sensin Earth Omaha Bits/Pixel 3.93 4.63 4.82 6.39 Total Size(bytes) Compression Ratio 32,261 37,896 39,504 52,321 2.03 1.73 1.66 1.25

3.8.2 Text Compression


Table 3.26 Probabilities of occurrence of the letters in the English alphabet in the U.S. Constitution. Letter A B C D E F G H I Probability 0.057305 0.014876 0.025775 0.026811 0.112578 0.022875 0.009523 0.042915 0.053475 Letter J K L M N O P Q R Probability 0.002031 0.001016 0.031403 0.015892 0.056035 0.058215 0.021034 0.000973 0.048819 Letter S T U V W X Y Z Probability 0.060289 0.078085 0.018474 0.009882 0.007576 0.002264 0.011702 0.001502

Adaptive Huffman coder Adv. Can be used as an on-line or real-time coder Disadv. More vulnerable to errors More difficult to implement

Ch 3 Huffman Coding

41

Ch 3 Huffman Coding

42

3.8.2 Text Compression


Table 3.27 Probabilities of occurrence of the letters in the English alphabet in this chapter. Letter A B C D E F G H I Probability 0.049885 0.016110 0.025835 0.030232 0.097434 0.019745 0.012053 0.035723 0.048783 Letter J K L M N O P Q R Probability 0.000394 0.002450 0.025835 0.016494 0.048039 0.050642 0.015007 0.001509 0.040492 Letter S T U V W X Y Z Probability 0.042657 0.061142 0.015794 0.004988 0.012207 0.003413 0.008466 0.001050

3.8.2 Text Compression


U.S. Constitution U.S. Constitution 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0 A B C D E F G H II A B C D E F G H JJ K L M N O P Q R S T U V W X Y Z K L M N O P Q R S T U V W X Y Z Chapter 1 Chapter 1

Ch 3 Huffman Coding

43

Ch 3 Huffman Coding

44

3.8.3 Audio Compression


CD-quality audio data CD-quality audio data Each stereo channel is sampled at 44.1kHz Each sample is represented by 16 bits. ( the amount of data stored on one CD is enormous )

3.8.3 Audio Compression


Table 3.28 Huffman codes of 16-bit CD-quality audio. File Name Mozart Cohn Mir Original File Size(bytes) 939,862 402,442 884,020 Entropy(bits) 12.8 13.8 13.7 Estimated Compressed File Size(bytes) 725,420 349,300 759,540 Compression Ratio 1.30 1.15 1.16

16 bits : 65,536 distinct values Huffman coder would require 65,536 distinct (variable-length) codewords. In most applications, a codeword of this size would not be practical. Large alphabet Recursive indexing chapter 8 Others [ reference: #180]

Table 3.29 Huffman codes of differences of 16-bit CD-quality audio. File Name Mozart Cohn Mir 45 Original File Size(bytes) 939,862 402,442 884,020 Entropy(bits) 9.7 10.4 10.9 Estimated Compressed File Size(bytes) 569,792 261,590 602,240 Compression Ratio 1.65 1.54 1.47 46

Ch 3 Huffman Coding

Ch 3 Huffman Coding

Potrebbero piacerti anche