Sei sulla pagina 1di 8

Assignment-2, EITN45

wir14sre@student.lu.se, 900128T311

Sakib- Bin- Redhwan,

Objective:
The objective of this assignment is to find the probability distributions of the letters of a text
and to construct an optimal source code based on the estimation. Also, the aim includes
compression of the text and comparing the average codeword length with the entropy of the
estimated distribution.
Method:
It is found in the theory that Huffman code is the optimal source code for an input data. So, in
the assignment the provided text on the file LifeOnMars.txt was encoded with Huffman
code.
The problem was entirely solved in MATLAB; the code is attached in the appendix section.
The outline of the methodology is given below1.
2.
3.
4.
5.

The probabilities for all characters in the text was found.


The coding was done following Huffman algorithm [1].
Average code length was determined using Path Length Lemma [1].
Entropy was found using the custom function Entropy.m (also appended).
Total length of the encoded text was found by multiplying occurrences of each letter
with its code length.
6. Total length of the un-coded text was found by multiplying total no. of letters in the
text with 8.

Assignment-2, EITN45
wir14sre@student.lu.se, 900128T311

Sakib- Bin- Redhwan,

Results:
The distribution of the letters along with the assigned codes are given belowLetters

Probabilities

Occurrence

Code

[New line]
[Space]
[Apostrophe]
'a'
'b'
'c'
'd'
'e'
'f'
'g'
'h'
'i'
'k'
'l'
'm'
'n'
'o'
'p'
'r'
's'
't'
'u'
'v'
'w'
'y'
'z'

0.0343
0.1733
0.0104
0.0575
0.0144
0.0136
0.0208
0.0942
0.0192
0.0200
0.0527
0.0543
0.0152
0.0351
0.0256
0.0519
0.0727
0.0032
0.0447
0.0599
0.0671
0.0216
0.0064
0.0216
0.0096
0.0008

43
217
13
72
18
17
26
118
24
25
66
68
19
44
32
65
91
4
56
75
84
27
8
27
12
1

01100
000
111011
1000
011011
111010
001010
110
001101
001011
1010
1001
011010
00111
11100
1011
0100
001100010
1111
0111
0101
001000
00110000
001001
0011001
001100011

Code
length
5
3
6
4
6
6
6
3
6
6
4
4
6
5
5
4
4
9
4
4
4
6
8
6
7
9

#Zeros #Ones
3
3
1
3
2
2
4
1
3
3
2
2
3
2
2
1
3
6
0
1
2
5
6
4
4
5

2
0
5
1
4
4
2
2
3
3
2
2
3
3
3
3
1
3
4
3
2
1
2
2
3
4

Entropy of the estimated probability function = 4.1770 (found from Variable =


Entropy_coded)
The average number of code bits per source symbol = 4.2149 (Variable = avg_code_length).
Total length of the encoded text = 5277 (Variable= Length_coded_text)
Total length of the un-coded text = 10016 (Variable= Length_uncoded_text)
Compression Ratio= 1.8980.
Distribution of Zeros and ones (Variables= Dist_zeros, Dist_ones)
Probability of Zeros
0.5446

Probability of Ones
0.4554

Assignment-2, EITN45
wir14sre@student.lu.se, 900128T311

Sakib- Bin- Redhwan,

Conclusion:
From the result section we can see that the average code length is very close to the entropy of
the distribution. It satisfies the optimal code condition H(X) L H(X)+1.
The compression ratio is satisfactory and the distribution of zeros and ones are close to .
References:
[1] Information Theory Engineering, Stefan Hst, Lund University.

Assignment-2, EITN45
wir14sre@student.lu.se, 900128T311

Sakib- Bin- Redhwan,

APPENDIX
The main code to find the probabilities and the Huffman codes
clc;
clear;
fid = fopen('LifeOnMars.txt');
Txt = fscanf(fid,'%c');
fclose(fid);
j=1;
for i='a':'z'
result(j)=length(find(Txt==i));
j=j+1;
end
data='a':'z';
result(27)=length(find(Txt==char(39)));
data(27)=char(39);
result(28)=length(find(Txt==char(32)));
data(28)=char(32);
result(29)=length(find(Txt==char(10)));
data(29)=char(10);
data(29)=char(64);%Newline character is creating unexpected problem
coding part, So changed to @
data(28)=char(255);%Newline character is creating unexpected problem
coding part, So changed to

in
in

prob=result./sum(result);
%%---------------------------------------------------------------------%Needed
%%to get the no of characters
input=struct('data',[],'prob',[],'flag',[]);
for i=1:length(data)
input(i).data =data(i);
input(i).prob =prob(i);
input(i).flag=0;
end
input([input.prob] == 0) = [];
[B,order] = sort([input(:).prob],'descend');
input = input(order);
huff=input;
temp=struct('data',[],'prob',[],'flag',[]);
tree=struct('data',[],'code',[]);
avg_code_length=0;
while length(huff)>1
dump=0;
[B,order] = sort([huff(:).prob],'descend');
huff = huff(order);
%extracting the least probable symbols from huffman tree to add them
%together
temp.data=strcat(huff(length(huff)).data,huff(length(huff)-1).data);
temp.prob=huff(length(huff)).prob+huff(length(huff)-1).prob;
temp.flag=1;
%flag 1 denotes an internal node
avg_code_length=avg_code_length+temp.prob; %path length lemma, avg
length=sum of internal node probabilities
%So if no node is internal node then

code

Assignment-2, EITN45
wir14sre@student.lu.se, 900128T311

Sakib- Bin- Redhwan,

if (huff(length(huff)-1).flag==0) && (huff(length(huff)).flag == 0)


dump=1;
%Special case for 1st iteration
if length(tree)==1
tree(length(tree)).data=huff(length(huff)-1).data;
tree(length(tree)).code=0;
tree(length(tree)+1).data=huff(length(huff)).data;
tree(length(tree)).code=1;
else

end

tree(length(tree)+1).data=huff(length(huff)-1).data;
tree(length(tree)).code=0;
tree(length(tree)+1).data=huff(length(huff)).data;
tree(length(tree)).code=1;

end
%if the 1st node is not internal but 2nd node is internal
if (huff(length(huff)-1).flag==0)&&(dump==0)
tree(length(tree)+1).data=huff(length(huff)-1).data;
tree(length(tree)).code=0;
%if the 1st node is an internal node
elseif huff(length(huff)-1).flag==1

end

for j=1: length(huff(length(huff)-1).data)


a=huff(length(huff)-1).data(j);
for i=1:length(tree)
if a==tree(i).data
tree(i).code(length(tree(i).code)+1)=0;
end
end
end

%if the 1st node is internal but 2nd node is not internal
if (huff(length(huff)).flag==0)&&(dump==0)
tree(length(tree)+1).data=huff(length(huff)).data;
tree(length(tree)).code=1;
%if the 2nd node is an internal node
elseif huff(length(huff)).flag==1
for j=1: length(huff(length(huff)).data)
a=huff(length(huff)).data(j);
for i=1:length(tree)
if a==tree(i).data
tree(i).code(length(tree(i).code)+1)=1;
end
end
end
end
%removing already considered nodes from original tree and adding the new
%node
huff(length(huff)) = [];
huff(length(huff)) = [];
huff(length(huff)+1) = temp;
end
for i=1:length(tree)

Assignment-2, EITN45
wir14sre@student.lu.se, 900128T311

Sakib- Bin- Redhwan,

dict(i).data=tree(i).data;
dict(i).code=fliplr(tree(i).code);
end
%summarizing data in a structured way
summary=struct('data',[],'probability',[],'occurrence',[],'code',[],......
'codelength',[],'zeros',[],'ones',[]);
for i=1:length(dict)
summary(i).data=dict(i).data;
summary(i).code=dict(i).code;
summary(i).codelength=length(dict(i).code);
for j=1: length(input)
if summary(i).data==input(j).data
summary(i).probability=input(j).prob;
end
end
summary(i).occurrence=summary(i).probability*sum(result);
if summary(i).data=='@';
summary(i).data= char(10);
end
if summary(i).data==char(255);
summary(i).data= char(32);
end
summary(i).zeros= sum(summary(i).code(:)==0);
summary(i).ones= sum(summary(i).code(:)==1);
end
[B,order] = sort([summary(:).data],'ascend');
summary = summary(order);
%Entropy of the coded data
Entropy_coded=Entropy(reshape(prob,[],1));
%Length of the encoded text &
%Distribution of Zero's and One's in the coded sequence
Length_coded_text=0;
Number_zeros=0;
Number_ones=0;
for i=1:length(summary)
Length_coded_text=Length_coded_text+summary(i).occurrence*summary(i).codele
ngth;
Number_zeros=Number_zeros+summary(i).occurrence*summary(i).zeros;
Number_ones=Number_ones+summary(i).occurrence*summary(i).ones;
end
Length_uncoded_text=sum(result)*8;
compression_ratio=Length_uncoded_text/Length_coded_text;
Dist_zeros=Number_zeros/(Number_zeros+Number_ones);

Assignment-2, EITN45
wir14sre@student.lu.se, 900128T311

Sakib- Bin- Redhwan,

Function for calculating the Entropy


function H=Entropy(P)
% The Entropy function H=Entropy(P)
%
%P column vector: The vector is the probability distribution
%P matrix: Each column vector is interpreted as a probability distribution
%P scalar: The binary entropy function of [P; 1-P]
%P row vector: Each position gives binary entropy function.
%This block checks for any negative and greater than 1 values and creates
error prompts
%--------------------------------------------------------------------------------------if P(P>1)
error('The input probabilities cannot be greater than 1')
elseif P(P<0)
error('The input probabilities cannot be Less than 1')
end
%--------------------------------------------------------------------------------------[r,c]=size(P);
%This block checks if any probability distribution has sum ~=1.
%--------------------------------------------------------------------------------------dist_check=sum(P,1);%Checking the sum of the distributions
%This subblock deals with any decimal precision problem introduced by
MATLAB
%-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-.-.-.-.-.-.Precision=sum(dist_check,2);
if Precision>0.000000009 && Precision<1.000000001
Precision=1;
end
%-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-.-.-.-.-.-.if r~=1 && c~=Precision
%Excluding the row vector condition
error('The sum of input probability distributions cannot be less or
greater than 1')
end
%--------------------------------------------------------------------------------------%This block deals with the Matrix and column Vector input
%--------------------------------------------------------------------------------------if r~=1
log_p=log2(P);
ent=-1.*P.*log_p;
ent(isnan(ent))=0;
H=sum(ent,1); % second input is for dimension, 1 means column wise sum, 2
means row wise sum
%--------------------------------------------------------------------------------------%This block deals with the vector input
%--------------------------------------------------------------------------------------elseif r~=c
for i=1:c
a=P(i)*log2(P(i));
a(isnan(a))=0;
b=(1-P(i))*log2(1-P(i));

Assignment-2, EITN45
wir14sre@student.lu.se, 900128T311

Sakib- Bin- Redhwan,

b(isnan(b))=0;
H(i)=-1*(a+b);
end
%--------------------------------------------------------------------------------------%This block deals with the scalar input
%--------------------------------------------------------------------------------------else

a=P*log2(P);
a(isnan(a))=0;
b=(1-P)*log2(1-P);
b(isnan(b))=0;
H=-1*(a+b);

end

Potrebbero piacerti anche