Sei sulla pagina 1di 13

7/10/2016

Shannons Information Theory | Science4All

Shannons Information Theory


I never read original papers of the greatest scientists, but I got so intrigued by the information theory that I gave Claude Shannons
seminal paper a read. This paper is mind-blowing! In this single paper, Shannon introduced this new fundamental theory. He raised the
right questions, which no one else even thought of asking. This would have been enough to make this contribution earthshaking. But
amazingly enough, Shannon also provided most of the right answers with class and elegance. In comparison, it took decades for a
dozen of top physicists to dene the basics of quantum theory. Meanwhile, Shannon constructed something equivalent, all by himself,
in a single paper.
Shannons theory has since transformed the world like no other ever had, from information technologies to telecommunications, from
theoretical physics to economical globalization, from everyday life to philosophy. But instead of taking my words for it, listen to Jim
Al-Khalili on BBC Horizon:

Documentary - Order And Disorder 2of2 Information

I dont think Shannon has had the credits he deserves. He should be right up there, near Darwin and Einstein, among the few greatest
scientists mankind has ever had the chance to have. Yet, hes hardly known by the public More than an explanation of Shannons
ideas, this article is a tribute to him.

Bits
Now, to understand how important his ideas are, lets go back in time and consider telecommunications in the 1940s. Back then, the
telephone network was quickly developing, both in North America and Europe. The two networks then got connected. But when a
message was sent through the Atlantic Ocean, it couldnt be read at the other end.
Why? What happened?
http://www.science4all.org/article/shannons-information-theory/

1/14

7/10/2016

Shannons Information Theory | Science4All

As the message travelled through the Atlantic Ocean, it got weakened and weakened. Eventually, it was so weak that it was
unreadable. Imagine the message was the logo of Science4All. The following gure displays what happened:

Why not amplifying the message along the way?


This was what was proposed by engineers. However, this led them to face the actual problem of communication.
Which is?
The unpredictable perturbation of the message! This perturbation is called noise. This noise is precisely what prevents a message from
getting through.
I dont see the link between amplication and noise
When youre amplifying the message, youre also amplifying the noise. Thus, even though the noise is small, as you amplify the
message over and over, the noise eventually gets bigger than the message. And if the noise is bigger than the message, then the
message cannot be read. This is displayed below:

At that time, it seemed to be impossible to get rid of the noise. There really seemed to be this fundamental limit to communication over
long distances. No matter when or how you amplify the message, the noise will still be much bigger than the message once it arrives in
Europe. But then came Claude Shannon
What did Shannon do?
Wonders! Among these wonders was an amazingly simple solution to communication. This idea comes from the observation that all
messages can be converted into binary digits, better known as bits. For instance, using the PNG format, the logo of Science4All can be
digitized into bits as follows:

Bits are not to be confused for bytes. A byte equals 8 bits. Thus, 1,000 bytes equal 8,000 bits.

This digitization of messages has revolutionized our world in a way that we too often forget to be fascinated by.
What do bits change to the communication problem?

http://www.science4all.org/article/shannons-information-theory/

2/14

7/10/2016

Shannons Information Theory | Science4All

Now, instead of simply amplifying the message, we can read it before. Because the digitized message is a sequel of 0s and 1s, it can be
read and repeated exactly. By replacing simple ampliers by readers and ampliers (known as regenerative repeaters), we can now
easily get messages through the Atlantic Ocean. And all over the world, as displayed below:

This gure is just a representation. The noise rather occurs on the bits. It sort of make bits take values around 0 and 1. The reader then considers that values like 0.1
equal 0, and repeats and amplify 0 instead of 0.1.

Now, in the rst page of his article, Shannon clearly says that the idea of bits is J. W. Tukeys. But, in a sense, this digitization is just an
approximation of Shannons more fundamental concept of bits. This more fundamental concept of bits is the quantication of
information, and is sometimes referred to as Shannons Bits.

Shannons Bits
Obviously, the most important concept of Shannons information theory is information. Although we all seem to have an idea of what
information is, its nearly impossible to dene it clearly. And, surely enough, the denition given by Shannon seems to come out of
nowhere. But it works fantastically.
Whats the denition?
According to Shannons brilliant theory, the concept of information strongly depends on the context. For instance, my full rst name is
L Nguyn. But in western countries, people simply call me L. Meanwhile, in Vietnam, people rather use my full rst name.
Somehow, the word L is not enough to identify me in Vietnam, as its a common name over there. In other words, the word L has
less information in Vietnam than in western countries. Similarly, if you talk about the man with hair, you are not giving away a
lot of information, unless you are surrounded by soldiers who nearly all have their hair cut.
But what is a context in mathematical terms?
A context corresponds to what messages you expect. More precisely, the context is dened by the probability of the messages. In our
example, the probability of calling someone L in western countries is much less likely than in Vietnam. Thus, the context of messages
in Vietnam strongly differs from the context of western countries.
OK So now, whats information?
Well, we said that the information of L is greater in western countries
So the rarer the message, the more information it has?
Yes! If p is the probability of the message, then its information is related to 1/p . But this is not how Shannon quantied it, as this
quantication would not have nice properties. Shannons great idea was to dene information rather as the number of bits required to
write the number 1/p . This number is its logarithm in base 2, which we denote log (1/p) .
2

If youre uncomfortable with logarithms, read my article on these mathematical operators. You dont need a full understanding of logarithms to read through the rest of
the articles though. If you do know about logarithms, you have certainly noticed that, more often than not, Shannons number of bits is not a whole number.

Now, this means that it would require more bits to digitize the word L in western countries than in Vietnam, as displayed below:
http://www.science4all.org/article/shannons-information-theory/

3/14

7/10/2016

Shannons Information Theory | Science4All

Why did Shannon use the logarithm?


Because of its nice properties. First, the logarithm enables to bring enormous numbers 1/p to more reasonable ones. But mainly, if you
consider a half of a text, it is common to say that it has half the information of the text in its whole. This sentence can only be true if
we quantify information as the logarithm of 1/p . This is due to the property of logarithm to transform multiplication (which appears in
probabilistic reasonings) into addition (which we actually use). Now, this logarithm doesnt need to be in base 2, but for digitization
and interpretation, it is very useful to do so.
Well, if I read only half of a text, it may contain most of the information of the text rather than the half of it
This is an awesome remark! Indeed, if the fraction of the text you read is its abstract, then you already kind of know what the
information the whole text has. Similarly, L Nguyn, even in Vietnam, doesnt have twice the information that L has.
Does Shannons quantication account for that?
It does! And the reason it does is because the rst fraction of the message modies the context of the rest of the message. In other
words, the conditional probability of the rest of the message is sensitive to the rst fraction of the message. This updating process leads
to counter-intuitive results, but it is an extremely powerful one. Find out more with my article on conditional probabilities.
Are there applications of this quantication of information?
The whole industry of new technologies and telecommunications! But let me rst present you a more surprising application to the
understanding of time perception explain in this TedED video by Matt Danzico. Now that you know about Shannons information
theory, you should have a new insight into what the video talks about!

http://www.science4all.org/article/shannons-information-theory/

4/14

7/10/2016

Shannons Information Theory | Science4All

An exercise in time perception - Matt Danzico

Check out also this other TedED video on impressions of people.

Wow! Indeed! But can you explain how Shannons theory is applied to telecommuncations?
Yes! As Shannon put it in his seminal paper, telecommunication cannot be thought in terms of information of a particular message.
Indeed, a communication device has to be able to work with any information of the context. This has led Shannon to (re)-dene the
fundamental concept of entropy, which talks about information of a context.
Theres a funny story about the coining of the term coining of the term entropy, which Shannon rst wanted to call uncertainty function. But John von Neumann
gave him the following advice:

You should call it entropy, for two reasons. In the rst place your uncertainty function has been used in statistical
mechanics under that name, so it already has a name. In the second place, and more important, no one really knows
what entropy really is, so in a debate you will always have the advantage.

Shannons Entropy
In 1877, Ludwig Boltzmann shook the world of physics by dening the entropy of gases, which greatly conrmed the atomic theory.
He dened the entropy more or less as the logarithm of the number of microstates which correspond to a macrostate. For instance, a
macrostate would say that a set of particles has a certain volume, pressure, mass and temperature. Meanwhile, a microstate denes the
position and velocity of every particle.

http://www.science4all.org/article/shannons-information-theory/

5/14

7/10/2016

Shannons Information Theory | Science4All

Find out more about entropy in thermodynamics with my article on the second law.

The brilliance of Shannon was to focus on the essence of Boltzmanns idea and to provide the broader framework in which to dene
entropy.
Whats Shannons denition of entropy?
Shannons entropy is dened for a context and equals the average amount of information provided by messages of the context. Since
each message is given with probability p and has information log (1/p), the average amount of information is the sum for all messages
of p log (1/p). This is explained in the following gure, where each color stands for a possible message of the context:
2

In the case of a continuous probability with a density function f , the entropy can be dened as the integral of f
provides a powerful understanding of information.

log (1/f )
2

. Although it loses a bit of its meaning, it still

I dont see the link with Boltzmanns entropy


In Boltzmanns setting, all microstates are assumed equally likely. Thus, for all microstate, 1/p equals the number of microstates. The
average amount of information is therefore the logarithm of the number of microstates. This shows that Boltzmanns entropy is nothing
more than Shannons entropy applied to equiprobable microstates.
Shannon also proved that, given a certain number of states, the entropy of the distribution of states is maximized when all states are equally likely. As a result, when
playing the price is right, if you know that the price is somewhere 1,000$ and 2,000$, then guessing 1,500$ will be whats providing you the most information in
average.

But Ive heard entropy had to do with disorder


This is another important interpretation of entropy. For the average information to be high, the context must allow for a large number
of unlikely events. Another way of phrasing this is to say that there is a lot of uncertainties in the context. In other words, entropy is a
measure of the spreading of a probability. This spreading is whats often referred to as disorder. In some sense, the second law of
thermodynamics which states that entropy cannot decrease can be reinterpreted as the increasing impossibility of dening precise
contexts on a macroscopic level.
http://www.science4all.org/article/shannons-information-theory/

6/14

7/10/2016

Shannons Information Theory | Science4All

I guess entropy is useful in physics But in communication?


It is essential! The most important application probably regards data compression. Indeed, the entropy provides the theoretical limit to
the average number of bits to code a message of a context. It also gives an insight into how to do so. Data compression has been
applied to image, audio or le compressing, and is now essential on the Web. Youtube videos can now be compressed enough to surf
all over the Internet! But before talking about communication, lets dig in a major variant of entropy.

Shannons Equivocation
By considering a conditional probability, Shannon dened conditional entropy, also known as Shannons equivocations. Lets
consider the entropy of a message conditional to its introduction. For any given introduction, the message can be described with a
conditional probability. This denes a entropy conditional to the given introduction. Now, the conditional entropy is the average of this
entropy conditional to the given introduction, when this given introduction follows the probabilistic distribution of introductions.
Roughly said, the conditional entropy is the average added information of the message given its introduction.
Its getting complicated
I know! But if you manage to get your head around that, youll understand much of the greatest ideas of Shannon.
Does this denition even match common sense?
Yes! Common sense says that the added information of a message to its introduction should not be larger than the information of the
message. This translates into saying that the conditional entropy should be lower than the non-conditional entropy. This is a theorem
proven by Shannon! In fact, he went further and quantied this sentence: The entropy of a message is the sum of the entropy of its
introduction and the entropy of the message conditional to its introduction!
Im lost!
Fortunately, everything can be more easily understood on a gure. The amount of information of the introduction and the message can
be drawn as circles. Because they are not independent, they have some mutual information, which is the intersection of the circles. The
conditional entropies correspond to whats missing from the mutual information to retrieve the entire entropies:

Im not sure I get it


Letsee examples. On the left of the following gure is the entropies of two coins thrown independently. On the right is the case where
only one coin is thrown, and where the blue corresponds to a sensor which says which face the coin fell on. The sensor has two
positions (heads or tails), but, now, all the information is mutual:

http://www.science4all.org/article/shannons-information-theory/

7/14

7/10/2016

Shannons Information Theory | Science4All

As you can see, in the second case, conditional entropies are nil. Indeed, once we know the result of the sensor, then the coin no longer
provides any information. Thus, in average, the conditional information of the coin is zero. In other words, the conditional entropy is
nil.
Waw This formalism really is powerful to talk about information!
It surely is! In fact, its so powerful that some of the weirdest phenomena of quantum mechanics like the mysterious entanglement
might be explainable with a generalization of information theory known as quantum information theory.
I dont know much about quantum information theory, but Id love to know more. If you can, please write an article on that topic!

Why are these concepts so important?


Theyre essential to understand sequences of symbols. Indeed, if you try to encode a message by encoding each character individually,
you will be consuming space to repeat mutual information. In fact, as Shannon studied the English language, he noticed that the
conditional entropy of a letter knowing the previous one is greatly decreased from its non-conditional entropy. Indeed, if a word starts
with a r, then its very likely (sure?) that the next letter will be a vowel.
So?
So to optimally compress the information of a text, its not enough to encode each character separately like the Morse code does. The
structure of information also lies in the concatenation into longer texts. In fact, Shannon dened the entropy of each character as the
limit of the entropy of messages of great size divided by the size.
To study this structure, its necessary to use the formalism of Markov chain. To keep it simple here, I wont. But the role of Markov chains is so essential in plenty of
elds that, if you can, you should write about them!

As it turns out, the decrease of entropy when we consider concatenations of letters and words is a common feature of all human
languages and of dolphin languages too! This has led extraterrestrial intelligence seekers to search for electromagnetic signals from
outer spaces which share this common feature too, as explained in this brilliant video by Art of the Problem:

http://www.science4all.org/article/shannons-information-theory/

8/14

7/10/2016

Shannons Information Theory | Science4All

The History of SETI (Search for Extraterrestrial Intelligence)

In some sense, researchers assimilate intelligence to the mere ability to decrease entropy. What an interesting thing to ponder upon!

Shannons Capacity
Lets now talk about communication! A communication consists in a sending of symbols through a channel to some other end. Now,
we usually consider that this channel can carry a limited amount of information every second. Shannon calls this limit the capacity of
the channel. It is measured in bits per second, although nowadays we rather use units like megabits per second (Mbit/s) or megabytes
per second (MB/s).
Why would channels have capacities?
The channel is usually using a physical measurable quantity to send a message. This can be the pressure of air in case of oral
communication. For longer telecommunications, we use the electromagnetic eld. The message is then encoded by mixing it into a
high frequency signal. The frequency of the signal is the limit, as using messages with higher frequencies would profoundly modify
the fundamental frequency of the signal. But dont bother too much with these details. Whats of concern to us here is that a channel
has a capacity.
Can you provide an example?
Sure. Imagine there was a gigantic network of telecommunication spread all over the world to exchange data, like texts and images.
Lets call it the Internet. How fast can we download images from the servers of the Internet to our computers? Using the basic
formatting called Bitmap or BMP, we can encode images pixels per pixels. The encoded images are then decomposed into a certain
number of bits. The average rate of transfer is then deduced from the average size of encoded images and the channels capacity:

http://www.science4all.org/article/shannons-information-theory/

9/14

7/10/2016

Shannons Information Theory | Science4All

In the example, using bitmap encoding, the images can be transfered at the rate of 5 images per second. In the webpage you are
currently looking at, there are about a dozen images. This means that more than 2 seconds would be required for the webpage to be
downloaded on your computer. Thats not very fast
Cant we transfer images faster?
Yes, we can. The capacity cannot be exceed, but the encoding of images can be improved. Now, what Shannon proved is that we can
come up with encodings such that the average size of the images nearly maps Shannons entropy! With these nearly optimal encodings,
an optimal rate of image le transfer can be reached, as displayed below:

This formula is called Shannons fundamental theorem of noiseless channels. It is basically a direct application of the concept of
entropy.
Noiseless channels? What do you mean?
I mean that we have here assumed that the received data was identical to whats sent! This is not the case in actual communication. As
opposed to what we have discussed in the rst section of this article, even bits can be badly communicated.

Shannons Redundancy
In actual communication, its possible that 10% of the bits get wrong.
Does this mean that only 90% of the information gets through?
http://www.science4all.org/article/shannons-information-theory/

10/14

7/10/2016

Shannons Information Theory | Science4All

No! The problem is that we dont know which are the bits which got wrong. In your case, the information that gets through is thus less
than 90%.
So how did Shannon cope with noise?
His amazing insight was to consider that the received deformed message is still described by a probability, which is conditional to the
sent message. This is where the language of equivocation or conditional entropy is essential. In the noiseless case, given a sent
message, the received message is certain. In other words, the conditional probability is reduced to a probability 1 that the received
message is the sent message. In Shannons powerful language, this all beautifully boils down to saying that the conditional entropy of
the received message is nil. Or, even more precisely, the mutual information equals both the entropies of the received and of the sent
message. Just like the sensor detecting the coin in the above example.
What about the general case?
The relevant information received at the other end is the mutual information. This mutual information is precisely the entropy
communicated by the channel. Shannons revolutionary theorem says that we can provide the missing information by sending a
correction message whose entropy is this conditional entropy of the sent message given the received message. This correction message
is known as Shannons redundancy.
This fundamental theorem is described in the following gure, where the word entropy can be replaced by average information:

Im skipping through a bit of technical details here, as I just want to show you the main idea of redundancy. To be accurate, I should talk in terms of entropies per
second with an optimal encoding.

Shannon proved that by adding redundancy with enough entropy, we could reconstruct the information perfectly almost surely (with a
probability as close to 1 as possible). This idea is another of Shannons earthshaking idea. Quite often, the redundant message is sent
with the message, and guarantees that, almost surely, the message will be readable once received. Its like having to read articles again
and again to nally retrieve its information.
So redundancy is basically repeating the message?
There are smarter ways to do so, as my students sometimes recall me by asking me to reexplain reasonings differently. Shannon
worked on that later, and managed other remarkable breakthroughs. Similarly to theorems I have mentioned above, Shannons theorem
http://www.science4all.org/article/shannons-information-theory/

11/14

7/10/2016

Shannons Information Theory | Science4All

for noisy channels provides a limit to the minimum quantity of redundancy required to almost surely retrieve the message. In practice,
this limit is hard to reach though, as it depends on the probabilistic structure of the information.
Does Shannon theorem explain why the English language is so redundant?
Yes! Redundancy is essential in common languages, as we dont actually catch most of whats said. But, because English is so
redundant, we can guess whats missing from what weve heard. For instance, whenever you hear I l*v* cake, you can easily ll the
blank. Whats particularly surprising is that we actually do most of this reconstitution without even being aware of it! You dont
believe me? Check the McGurk effect, explained here by Myles Power and Alex Dainis:

FWS - The McGurk Effect - Hearing lips and seeing voices

It wouldnt surprise me to nd out that languages are nearly optimized for oral communications in Shannons sense. Although there
denitely are other factors coming in play, which have to explain, for instance, why the French language is so more redundant than
English

Lets Conclude
What Ive presented here are just the few fundamental ideas of Shannon for messages with discrete probabilities. Claude Shannon then
moves on generalizing these ideas to discuss communication using actual electromagnetic signals, whose probabilities now have to be
described using probabilistic density functions. Although this doesnt affect the profound fundamental ideas of information and
communication, it does lead to a much more complex mathematical study. Once again, Shannons work is fantastic. But, instead of
trusting me, you probably should rather listen to his colleagues who have inherited his theory in this documentary by UCTV:

http://www.science4all.org/article/shannons-information-theory/

12/14

7/10/2016

Shannons Information Theory | Science4All

Claude Shannon - Father of the Information...

The documentary is awesome! You should watch it entirely!

Shannon did not only write the 1948 paper. In fact, the rst major breakthrough he did was back when he was a Masters student at
MIT. His thesis is by far the most inuential Masters thesis of all time, as it shows how exploiting boolean algebra could enable to
produce machines that would compute anything. In other words, in his Masters thesis, Shannon drew the blueprints of computers!
Shannon also made crucial progress in cryptography and articial intelligence.
I can only invite you to go further and learn more. This is whats commonly called open your mind. Im going to conclude with this,
but in Shannons language Increase the entropy of your thoughts!

40

More on Science4All
Entropy and the Second Law of Thermodynamics
Conditional Probabilities: Know what you Learn

More Elsewhere
Episode 2: Language of Coins (Information Theory) on Art of the Problem.
What is Information? Part 2a Information Theory on Cracking the Nutshell.
Without Shannon's information theory there would have been no internet on The Guardian.
The Legacy of Entropy on Everything about Data Analytics.

Comments
http://www.science4all.org/article/shannons-information-theory/

13/14

Potrebbero piacerti anche