Sei sulla pagina 1di 3

AM 50: Homework 4 (due April 6th at 12 PM)

Program files: A number of program files for this homework can be downloaded as a single ZIP file from the
course website.
1.

(a) Let a be a real number where 1 < a < 1. By considering (1 a) nN=0 an show that
N

an =

n =0

and hence that

1 a N +1
1a

an = 1 a .

(1)

(2)

n =0

n
(b) By differentiating both sides of Eq. 2 or otherwise, find an expression for
n=0 na .

(c) Consider a coin with probability p of obtaining a head, and probability q = 1 p of


obtaining a tail. Let N be a random variable that is the number of coin tosses required
until first obtaining a tail. Calculate the Shannon entropy of N.
2. Channel entropy. Consider a channel where information is sent as three-digit decimal
numbers of the form 000, 001, 002, . . . , 148, 149. Assume that each of the 150 numbers is
1
1
equally probable. The entropy of a number is therefore HN = 150( 150
log2 150
) = log2 150.
(a) Calculate the Shannon entropy of the first digit, second digit, and third digit.
(b) Calculate the sum of the entropies of the three digits. Why is the sum not equal to HN ?
(c) Suppose that the second digit of a number is 3. What are entropies of the first and third
digits? Are the entropies the same as in part (a)?
3. Digram analysis of different languages. In his paper, Shannon considers the digram structure of English, where he looks at the probabilities of two-letter combinations. It turns out that
digrams represent a simple but surprisingly powerful way to differentiate between languages.
To illustrate this, we have downloaded out-of-copyright books in five different European
languages from Project Gutenberg:
0. English: The Voyage Out by Virginia Woolf (1915)
1. French: LAtlantide by Pierre Benot (1920)
2. German: Siddartha by Hermann Hesse (1922)
3. Italian: Dal Cellulare a Finalborgo by Paolo Valera (1899)
4. Spanish: La Voz de la Conseja edited by Emilio Carrre (1920)
These books are included in the ZIP file for the homework. They have been pre-processed to
remove any special characters and accented characters. A program digram.py is supplied,
which scans each book and calculates the probability distribution of the digrams. It removes
punctuation, converts everything to lower case, and then considers each word. In the analysis,
it considers 27 characters, with 0 corresponding to a space, 1 = a, . . . , 25 = y, and 26 = z. For
a word such as toast, the program considers it with a space at both ends as
_toast_
1

(3)

and then counts digrams _t, to, oa, as, st, and t_. After the program has run, it will have
k of the occurrence of digram (i, j ) in language k. To simplify
assembled probabilities pi,j
things, the program creates a tiny fictitious probability in any digram that was not seen when
k > 0 for all combinations i, j, and k and you do not
scanning the books (such as qx). Hence pi,j
have to worry about taking the logarithm of zero in subsequent computations.
(a) Run the program digram.py. It will build the digram probabilities and then output a 2D
matrix for the English digrams. What do you notice about the row for q? Comment on a
few other features of the matrix.
(b) There are a few lines at the end of the program that can be modified to plot the difference
between two languages. Modify the program as directed to plot the difference between
Italian and Spanish. Comment on some key differences between the two languages.
(c) Add lines to the program to calculate the entropy of selecting a random digram for
each of the five languages. What language has the most entropy? What language has
the least? Compare the entropies against a hypothetical language in which all 27 27
digrams are equally probable.
(d) Add lines to the program to calculate a measure of the difference between two languages,
v
u 26 26 
2
u
k pl
D (k, l ) = t pi,j
i,j .
i =0 j =0

What pair of languages are the most similar? What pair of languages are the most
different?
4. Automatic language detection. Digrams can be used to differentiate between languages.
Suppose that a given message consists of a sequence of digrams (i , j ) for = 0, . . . , N 1.
Then for language k, the likelihood of observing the message is
N 1

L(k) =

=0

pik ,j .

The above formula is unwieldy to use computationally, since it involves multiplying many
tiny numbers together. Taking the logarithm of both sides yields the log-likelihood formula
N 1

log2 L(k ) =

=0

log2 ( pik ,j ),

(4)

which is easier to deal with. Write a Python program to take a string of text and evaluate the
log-likelihood as in Eq. 4. For each word in the string, consider it as in (3) above. It may be
useful to look at how the create_table function in digram.py works, since each word can be
processed in a very similar way.
For each of the following strings, use your program to calculate the log-likelihood for each of
the five languages, and determine which language the string is most likely to be written in.
(a) Tom Hanks, Penelope Cruz, Juliette Binoche, and three other names.
2

(b) New York, Colorado, Vermont, Alberta, and three other states/provinces.
(c) Los Angeles, Anaheim, Cincinnati, Portland, and three other cities.
(d) hello, bonjour, hola, hi, and three other greetings.
(e) oui, auf, uno, and three other small words.
(f) Words are, in my not-so-humble opinion, our most inexhaustible source of magic.
(g) En art comme en amour, linstinct suffit.
Finally, find an example that does not match your expectation, and briefly discuss a strategy
that might make the program more accurate to handle this case.
5. Decoding error-correcting codes. Extra credit.
(a) The file code1.txt is an ASCII message that has been converted to binary, encoded
using a three-bit repetition code, and has had some noise artificially introduced. As an
example, the character q is 113 in ASCII, which is 01110001 in binary. This would be
encoded as
000111111111000000000111
and after noise has been artificially introduced it could be
010011110111001001100101,
although due to the redundancy, the character can still be decoded. Write a Python
program to decode the message. The Python command chr will be useful to convert an
integer into an ASCII character.
(b) Decode the file code2.txt, which is encoded using a (7, 4) Hamming code.

Potrebbero piacerti anche