Entropy (Information Theory)

Entropy (information theory)
Information entropy is defined as the average amount of information produced by a stochastic source of
data.
The measure of information entropy associated with each possible data value is the negative logarithm of
the probability mass function for the value. Thus, when the data source has a lower-probability value (i.e.,
when a low-probability event occurs), the event carries more "information" ("surprisal") than when the
source data has a higher-probability value. The amount of information conveyed by each event defined in
this way becomes a random variable whose expected value is the information entropy.
Generally, entropy refers to disorder or uncertainty, and the definition of entropy used in information
theory is directly analogous to the definition used in statistical thermodynamics. The concept of
information entropy was introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of
Communication".[1]
The basic model of a data communication system is composed of three elements, a source of data,
a communication channel, and a receiver, and – as expressed by Shannon – the "fundamental problem of
communication" is for the receiver to be able to identify what data was generated by the source, based on
the signal it receives through the channel.[2] The entropy provides an absolute limit on the shortest
possible average length of a lossless compression encoding of the data produced by a source, and if the
entropy of the source is less than the channel capacity of the communication channel, the data generated
by the source can be reliably communicated to the receiver (at least in theory, possibly neglecting some
practical considerations such as the complexity of the system needed to convey the data and the amount
of time it may take for the data to be conveyed).
Information entropy is typically measured in bits (alternatively called "shannons") or sometimes in "natural
units" (nats) or decimal digits (called "dits", "bans", or "hartleys"). The unit of the measurement depends
on the base of the logarithm that is used to define the entropy.
The logarithm of the probability distribution is useful as a measure of entropy because it is additive for
independent sources. For instance, the entropy of a fair coin toss is 1 bit, and the entropy of m tosses
is m bits. In a straightforward representation, log2(n) bits are needed to represent a variable that can take
one of n values if n is a power of 2. If these values are equally probable, the entropy (in bits) is equal to
this number. If one of the values is more probable to occur than the others, an observation that this value
occurs is less informative than if some less common outcome had occurred. Conversely, rarer events
provide more information when observed. Since observation of less probable events occurs more rarely,
the net effect is that the entropy (thought of as average information) received from non-uniformly
distributed data is always less than or equal to log2(n). Entropy is zero when one outcome is certain to
occur. The entropy quantifies these considerations when a probability distribution of the source data is
known. The meaning of the events observed (the meaning of messages) does not matter in the definition
of entropy. Entropy only takes into account the probability of observing a specific event, so the information
it encapsulates is information about the underlying probability distribution, not the meaning of the events
themselves.
Introduction
Entropy is a measure of unpredictability of the state, or equivalently, of its average information content. To
get an intuitive understanding of these terms, consider the example of a political poll. Usually, such polls
happen because the outcome of the poll is not already known. In other words, the outcome of the poll is
relatively unpredictable, and actually performing the poll and learning the results gives some
new information; these are just different ways of saying that the a priori entropy of the poll results is large.
Now, consider the case that the same poll is performed a second time shortly after the first poll. Since the
result of the first poll is already known, the outcome of the second poll can be predicted well and the
results should not contain much new information; in this case the a priori entropy of the second poll result
is small relative to that of the first.
Now consider the example of a coin toss. Assuming the probability of heads is the same as the probability
of tails, then the entropy of the coin toss is as high as it could be. This is because there is no way to
predict the outcome of the coin toss ahead of time: if we have to choose, the best we can do is predict
that the coin will come up heads, and this prediction will be correct with probability 1/2. Such a coin toss
has one bit of entropy since there are two possible outcomes that occur with equal probability, and
learning the actual outcome contains one bit of information. In contrast, a coin toss using a coin that has
two heads and no tails has zero entropy since the coin will always come up heads, and the outcome can
be predicted perfectly. Analogously, one binary-outcome with equiprobable values has a Shannon entropy
of bit. Similarly, one trit with equiprobable values contains (about 1.58496) bits of information
because it can have one of three values.
English text, treated as a string of characters, has fairly low entropy, i.e., is fairly predictable. Even if we
do not know exactly what is going to come next, we can be fairly certain that, for example, 'e' will be far
more common than 'z', that the combination 'qu' will be much more common than any other combination
with a 'q' in it, and that the combination 'th' will be more common than 'z', 'q', or 'qu'. After the first few
letters one can often guess the rest of the word. English text has between 0.6 and 1.3 bits of entropy per
character of the message.[3]
If a compression scheme is lossless—that is, you can always recover the entire original message by
decompressing—then a compressed message has the same quantity of information as the original, but
communicated in fewer characters. That is, it has more information, or a higher entropy, per character.
This means a compressed message has less redundancy. Roughly speaking, Shannon's source coding
theorem says that a lossless compression scheme cannot compress messages, on average, to
have more than one bit of information per bit of message, but that any value less than one bit of
information per bit of message can be attained by employing a suitable coding scheme. The entropy of a
message per bit multiplied by the length of that message is a measure of how much total information the
message contains.
Intuitively, imagine that we wish to transmit sequences comprising the 4 characters 'A', 'B', 'C', and 'D'.
Thus, a message to be transmitted might be 'ABADDCAB'. Information theory gives a way of calculating
the smallest possible amount of information that will convey this. If all 4 letters are equally likely (25%), we
can do no better (over a binary channel) than to have 2 bits encode (in binary) each letter: 'A' might code
as '00', 'B' as '01', 'C' as '10', and 'D' as '11'. Now suppose 'A' occurs with 70% probability, 'B' with 26%,
and 'C' and 'D' with 2% each. We could assign variable length codes, so that receiving a '1' tells us to look
at another bit unless we have already received 2 bits of sequential 1s. In this case, 'A' would be coded as
'0' (one bit), 'B' as '10', and 'C' and 'D' as '110' and '111'. It is easy to see that 70% of the time only one bit
needs to be sent, 26% of the time two bits, and only 4% of the time 3 bits. On average, then, fewer than 2
bits are required since the entropy is lower (owing to the high prevalence of 'A' followed by 'B' – together
96% of characters). The calculation of the sum of probability-weighted log probabilities measures and
captures this effect.
Shannon's theorem also implies that no lossless compression scheme can shorten all messages. If some
messages come out shorter, at least one must come out longer due to the pigeonhole principle. In
practical use, this is generally not a problem, because we are usually only interested in compressing
certain types of messages, for example English documents as opposed to gibberish text, or digital
photographs rather than noise, and it is unimportant if a compression algorithm makes some unlikely or
uninteresting sequences larger. However, the problem can still arise even in everyday use when applying
a compression algorithm to already compressed data: for example, making a ZIP file of music, pictures or
videos that are already in a compressed format such as FLAC, MP3, WebM, AAC, PNG or JPEG will
generally result in a ZIP file that is slightly larger than the source

Entropy (Information Theory)

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Entropy (Information Theory)

Caricato da

Copyright:

Formati disponibili

Entropy (information theory)

Potrebbero piacerti anche