Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Program files: A number of program files for this homework can be downloaded as a single ZIP file from the
course website.
1.
(a) Let a be a real number where 1 < a < 1. By considering (1 a) nN=0 an show that
N
an =
n =0
1 a N +1
1a
an = 1 a .
(1)
(2)
n =0
n
(b) By differentiating both sides of Eq. 2 or otherwise, find an expression for
n=0 na .
(3)
and then counts digrams _t, to, oa, as, st, and t_. After the program has run, it will have
k of the occurrence of digram (i, j ) in language k. To simplify
assembled probabilities pi,j
things, the program creates a tiny fictitious probability in any digram that was not seen when
k > 0 for all combinations i, j, and k and you do not
scanning the books (such as qx). Hence pi,j
have to worry about taking the logarithm of zero in subsequent computations.
(a) Run the program digram.py. It will build the digram probabilities and then output a 2D
matrix for the English digrams. What do you notice about the row for q? Comment on a
few other features of the matrix.
(b) There are a few lines at the end of the program that can be modified to plot the difference
between two languages. Modify the program as directed to plot the difference between
Italian and Spanish. Comment on some key differences between the two languages.
(c) Add lines to the program to calculate the entropy of selecting a random digram for
each of the five languages. What language has the most entropy? What language has
the least? Compare the entropies against a hypothetical language in which all 27 27
digrams are equally probable.
(d) Add lines to the program to calculate a measure of the difference between two languages,
v
u 26 26
2
u
k pl
D (k, l ) = t pi,j
i,j .
i =0 j =0
What pair of languages are the most similar? What pair of languages are the most
different?
4. Automatic language detection. Digrams can be used to differentiate between languages.
Suppose that a given message consists of a sequence of digrams (i , j ) for = 0, . . . , N 1.
Then for language k, the likelihood of observing the message is
N 1
L(k) =
=0
pik ,j .
The above formula is unwieldy to use computationally, since it involves multiplying many
tiny numbers together. Taking the logarithm of both sides yields the log-likelihood formula
N 1
log2 L(k ) =
=0
log2 ( pik ,j ),
(4)
which is easier to deal with. Write a Python program to take a string of text and evaluate the
log-likelihood as in Eq. 4. For each word in the string, consider it as in (3) above. It may be
useful to look at how the create_table function in digram.py works, since each word can be
processed in a very similar way.
For each of the following strings, use your program to calculate the log-likelihood for each of
the five languages, and determine which language the string is most likely to be written in.
(a) Tom Hanks, Penelope Cruz, Juliette Binoche, and three other names.
2
(b) New York, Colorado, Vermont, Alberta, and three other states/provinces.
(c) Los Angeles, Anaheim, Cincinnati, Portland, and three other cities.
(d) hello, bonjour, hola, hi, and three other greetings.
(e) oui, auf, uno, and three other small words.
(f) Words are, in my not-so-humble opinion, our most inexhaustible source of magic.
(g) En art comme en amour, linstinct suffit.
Finally, find an example that does not match your expectation, and briefly discuss a strategy
that might make the program more accurate to handle this case.
5. Decoding error-correcting codes. Extra credit.
(a) The file code1.txt is an ASCII message that has been converted to binary, encoded
using a three-bit repetition code, and has had some noise artificially introduced. As an
example, the character q is 113 in ASCII, which is 01110001 in binary. This would be
encoded as
000111111111000000000111
and after noise has been artificially introduced it could be
010011110111001001100101,
although due to the redundancy, the character can still be decoded. Write a Python
program to decode the message. The Python command chr will be useful to convert an
integer into an ASCII character.
(b) Decode the file code2.txt, which is encoded using a (7, 4) Hamming code.