Information Theory, Coding and Cryptography Unit-1 by Arun Pratap Singh

UNIT : I
INFORMATION THEORY, CODING & CRYPTOGRAPHY (MCSE 202)

PREPARED BY ARUN PRATAP SINGH 5/26/14 MTECH 2nd SEMESTER

PREPARED BY ARUN PRATAP SINGH 1

1
INTRODUCTION TO INFORMATION THEORY :
Information theory is a branch of science that deals with the analysis of a communications
system
We will study digital communications using a file (or network protocol) as the channel

Claude Shannon Published a landmark paper in 1948 that was the beginning of the branch of
information theory
We are interested in communicating information from a source to a destination

In our case, the messages will be a sequence of binary digits
Does anyone know the term for a binary digit?
One detail that makes communicating difficult is noise
noise introduces uncertainty
Suppose I wish to transmit one bit of information what are all of the possibilities?
tx 0, rx 0 - good
tx 0, rx 1 - error
tx 1, rx 0 - error
tx 1, rx 1 - good
Two of the cases above have errors this is where probability fits into the picture
In the case of steganography, the noise may be due to attacks on the hiding algorithm
UNIT : I


2

INFORMATION MEASURES :
Any information source, analog or digital, produces an output that is random in nature. If it were not
random, i.e., the output were known exactly, there would be no need to transmit it. We live in an analog
world and most sources are analog sources, for example , speech, temperature fluctuations etc. The
discrete sources are man-made sources, for example, a source (say, a man) that generates a sequence of
letters from a finite alphabet (typing his email).
Before we go on to develop a mathematical measure of information , let us develop an intuitive feel for
it. Read the following sentences :


3
(A) Tomorrow, the sun will rise from the East
(B) The phone will ring in the next one hour
(C) It will snow in Delhi this winter
The three sentences carry different amounts of information. In fact, the first sentence hardly carries any
information. Everybody knows that the sun rises in the East and the probability of this happening again is
almost unity.
Sentence (B) appears to carry more information than sentence (A). The phone may ring, or it may not.
There is a finite probability that the phone will ring in the next one .
The last sentence probably made you read it over twice . This is because it has never snowed in Delhi, and
the probability of snowfall is very low. It is interesting to note that the amount of information carried by
the sentences listed above have something to do with the probability of occurrence of the events stated
in the sentences. And we observe an inverse relationship. Sentence (A) which talks about an event which
has a parobability of occurrence very close to 1 carries almost no information. Sentence (C), which has a
very low probability of occurrence , appears to carry a lot of information. The other interesting thing to
note is that the length of the sentence has nothing to do with the amount of information it conveys. In
fact, sentence (A) is longest but carries the minimum information.
We will now develop a mathematical measure of information.


4


5


6

REVIEW PROBABILITY THEORY :
We could choose one of several technical definitions for probability, but for our purposes it refers to an
assessment of the likelihood of the various possible outcomes in an experiment or some other situation
with a random outcome.
Why Probability Theory?
Information is exchanged in a computer network in a random way, and events that modify the
behavior of links and nodes in the network are also random
We need a way to reason in quantitative ways about the likelihood of events in a network, and to
predict the behavior of network components.
Example 1:
Measure the time between two packet arrivals into the cable of a local area network.
Determine how likely it is that the interarrival time between any two packets is less than T sec.

A mathematical model used to quantify the likelihood of events taking place in an experiment in
which events are random.
It consists of:
A sample space: The set of all possible outcomes of a random experiment.
The set of events: Subsets of the sample space.
The probability measure: Defined according to a probability law for all the events of the
sample space.


7


8

RANDOM VARIABLES :
In probability and statistics, a random variable, aleatory variable or stochastic variable is
a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical
sense).
A random variable's possible values might represent the possible outcomes of a yet-to-be-performed
experiment, or the possible outcomes of a past experiment whose already-existing value is uncertain
(for example, as a result of incomplete information or imprecise measurements). They may also
conceptually represent either the results of an "objectively" random process (such as rolling a die), or
the "subjective" randomness that results from incomplete knowledge of a quantity. The meaning of the


9
probabilities assigned to the potential values of a random variable is not part of probability theory itself,
but instead related to philosophical arguments over the interpretation of probability.
The mathematical function describing the possible values of a random variable and their associated
probabilities is known as a probability distribution. Random variables can be discrete, that is, taking
any of a specified finite or countable list of values, endowed with a probability mass function,
characteristic of a probability distribution; or continuous, taking any numerical value in an interval or
collection of intervals, via a probability density function that is characteristic of a probability distribution;
or a mixture of both types. The realizations of a random variable, that is, the results of randomly
choosing values according to the variable's probability distribution function, are called random variates.
Example :
The possible outcomes for one coin toss can be described by the sample
space . We can introduce a real-valued random variable that models a $1
payoff for a successful bet on heads as follows:

If the coin is equally likely to land on either side then Y has a probability mass function given
by:


10


11


12


13


14

RANDOM PROCESS :
In probability theory, a stochastic process or sometimes random process (widely used) is a
collection of random variables; this is often used to represent the evolution of some random value, or
system, over time. This is the probabilistic counterpart to a deterministic process (or deterministic
system). Instead of describing a process which can only evolve in one way (as in the case, for example,
of solutions of an ordinary differential equation), in a stochastic or random process there is some
indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely
many) directions in which the process may evolve.
In the simple case of discrete time, as opposed to continuous time, a stochastic process involves
a sequence of random variables and the time series associated with these random variables (for
example, see Markov chain, also known as discrete-time Markov chain). Another basic type of a
stochastic process is a random field, whose domain is a region of space, in other words, a random
function whose arguments are drawn from a range of continuously changing values. One approach to
stochastic processes treats them as functions of one or several deterministic arguments (inputs, in
most cases regarded as time) whose values (outputs) are random variables: non-deterministic (single)
quantities which have certain probability distributions. Random variables corresponding to various
times (or points, in the case of random fields) may be completely different. The main requirement is
that these different random quantities all have the same type. Type refers to the codomain of the
function. Although the random values of a stochastic process at different times may be independent
random variables, in most commonly considered situations they exhibit complicated statistical
correlations.


15


16


17


18

MUTUAL INFORMATION :
In probability theory and information theory, the mutual information or (formerly) transinformation of
two random variables is a measure of the variables' mutual dependence. The most common unit of
measurement of mutual information is the bit.

Formally, the mutual information of two discrete random variables X and Y can be defined as:

where p(x,y) is the joint probability distribution function of X and Y, and and are
the marginal probability distribution functions of X and Y respectively.
In the case of continuous random variables, the summation is replaced by a definite double
integral:


19

where p(x,y) is now the joint probability density function of X and Y, and and are
the marginal probability density functions of X and Y respectively.
If the log base 2 is used, the units of mutual information are the bit.
Intuitively, mutual information measures the information that X and Y share: it measures how
much knowing one of these variables reduces uncertainty about the other. For example,
if X and Y are independent, then knowing X does not give any information about Y and vice
versa, so their mutual information is zero. At the other extreme, if X is a deterministic function
of Y and Y is a deterministic function of X then all information conveyed by X is shared with Y:
knowing X determines the value of Y and vice versa. As a result, in this case the mutual
information is the same as the uncertainty contained in Y (or X) alone, namely
the entropy of Y (or X). Moreover, this mutual information is the same as the entropy of X and
as the entropy of Y. (A very special case of this is when X and Y are the same random
variable.)
Mutual information is a measure of the inherent dependence expressed in the joint
distribution of X and Y relative to the joint distribution of X and Y under the assumption of
independence. Mutual information therefore measures dependence in the following
sense: I(X; Y) = 0 if and only if X and Y are independent random variables. This is easy to see
in one direction: if X and Y are independent, then p(x,y) = p(x) p(y), and therefore:

Moreover, mutual information is nonnegative (i.e. I(X;Y) 0; see below)
and symmetric (i.e. I(X;Y) = I(Y;X)).

OR


20


21


22


23

ENTROPY :
In information theory, entropy is a measure of the uncertainty in a random variable.
[1]
In this context,
the term usually refers to the Shannon entropy, which quantifies the expected value of
the information contained in a message.
[2]
Entropy is typically measured in bits, nats,
or bans.
[3]
Shannon entropy is the average unpredictability in a random variable, which is equivalent
to its information content. Shannon entropy provides an absolute limit on the best


24
possible lossless encoding or compression of any communication, assuming that

the communication
may be represented as a sequence of independent and identically distributed random variables.
A single toss of a fair coin has an entropy of one bit. A series of two fair coin tosses has an entropy of
two bits. The number of fair coin tosses is its entropy in bits. This random selection between two
outcomes in a sequence over time, whether the outcomes are equally probable or not, is often referred
to as a Bernoulli process. The entropy of such a process is given by the binary entropy function. The
entropy rate for a fair coin toss is one bit per toss. However, if the coin is not fair, then the uncertainty,
and hence the entropy rate, is lower. This is because, if asked to predict the next outcome, we could
choose the most frequent result and be right more often than wrong. The difference between what we
know, or predict, and the information that the unfair coin toss reveals to us is less than one heads-or-
tails "message", or bit, per toss.


25

SHANNONS THEOREM :
Shannon's theorem, proved by Claude Shannon in 1948, describes the maximum possible efficiency of
error correcting methods versus levels of noise interference and data corruption.
The theory doesn't describe how to construct the error-correcting method, it only tells us how good the
best possible method can be. Shannon's theorem has wide-ranging applications in both communications
and data storage applications.
Considering all possible multi-level and multi-phase encoding techniques, the ShannonHartley
theorem states the channel capacity C, meaning the theoretical tightest upper bound on
the information rate (excluding error correcting codes) of clean (or arbitrarily low bit error rate) data
that can be sent with a given average signal power S through an analog communication channel
subject to additive white Gaussian noise of power N, is:

Where
C is the channel capacity in bits per second;


26
B is the bandwidth of the channel in hertz (passband bandwidth in case of a modulated
signal);
S is the average received signal power over the bandwidth (in case of a modulated signal,
often denoted C, i.e. modulated carrier), measured in watts (or volts squared);
N is the average noise or interference power over the bandwidth, measured in watts (or volts
squared); and
S/N is the signal-to-noise ratio (SNR) or the carrier-to-noise ratio (CNR) of the
communication signal to the Gaussian noise interference expressed as a linear power ratio
(not as logarithmic decibels).

Example :
If the SNR is 20 dB, and the bandwidth available is 4 kHz, which is appropriate for telephone
communications, then C = 4 log2(1 + 100) = 4 log2 (101) = 26.63 kbit/s. Note that the value of
100 is appropriate for an SNR of 20 dB.
If it is required to transmit at 50 kbit/s, and a bandwidth of 1 MHz is used, then the minimum
SNR required is given by 50 = 1000 log2(1+S/N) so S/N = 2C/W -1 = 0.035 corresponding to an
SNR of -14.5 dB. This shows that it is possible to transmit using signals which are actually much
weaker than the background noise level.

Shannon's law is any statement defining the theoretical maximum rate at which error free digits
can be transmitted over a bandwidth limited channel in the presence of noise.
Shannon theorem puts a limit on transmission data rate, not on error probability:
Theoretically possible to transmit information at any rate Rb , where Rb C with
an arbitrary small error probability by using a sufficiently complicated coding
scheme.
For an information rate Rb > C , it is not possible to find a code that can achieve
an arbitrary small error probability.


27


28

Noisy channel coding theorem and capacity :
Claude Shannon's development of information theory during World War II provided the next big
step in understanding how much information could be reliably communicated through noisy
channels. Building on Hartley's foundation, Shannon's noisy channel coding theorem (1948)
describes the maximum possible efficiency of error-correcting methods versus levels of noise
interference and data corruption.
[5][6]
The proof of the theorem shows that a randomly constructed
error-correcting code is essentially as good as the best possible code; the theorem is proved
through the statistics of such random codes.
Shannon's theorem shows how to compute a channel capacity from a statistical description of a
channel, and establishes that given a noisy channel with capacity C and information transmitted
at a line rate R, then if

there exists a coding technique which allows the probability of error at the receiver to be made
arbitrarily small. This means that theoretically, it is possible to transmit information nearly without
error up to nearly a limit of C bits per second.
The converse is also important. If

the probability of error at the receiver increases without bound as the rate is increased. So no
useful information can be transmitted beyond the channel capacity. The theorem does not
address the rare situation in which rate and capacity are equal.


29
The ShannonHartley theorem establishes what that channel capacity is for a finite-
bandwidth continuous-time channel subject to Gaussian noise. It connects Hartley's result with
Shannon's channel capacity theorem in a form that is equivalent to specifying the M in Hartley's
line rate formula in terms of a signal-to-noise ratio, but achieving reliability through error-correction
coding rather than through reliably distinguishable pulse levels.
If there were such a thing as a noise-free analog channel, one could transmit unlimited amounts
of error-free data over it per unit of time (Note: An infinite-bandwidth analog channel can't transmit
unlimited amounts of error-free data, without infinite signal power). Real channels, however, are
subject to limitations imposed by both finite bandwidth and nonzero noise.
So how do bandwidth and noise affect the rate at which information can be transmitted over an
analog channel?
Surprisingly, bandwidth limitations alone do not impose a cap on maximum information rate. This
is because it is still possible for the signal to take on an indefinitely large number of different
voltage levels on each symbol pulse, with each slightly different level being assigned a different
meaning or bit sequence. If we combine both noise and bandwidth limitations, however, we do
find there is a limit to the amount of information that can be transferred by a signal of a bounded
power, even when clever multi-level encoding techniques are used.
In the channel considered by the ShannonHartley theorem, noise and signal are combined by
addition. That is, the receiver measures a signal that is equal to the sum of the signal encoding
the desired information and a continuous random variable that represents the noise. This addition
creates uncertainty as to the original signal's value. If the receiver has some information about
the random process that generates the noise, one can in principle recover the information in the
original signal by considering all possible states of the noise process. In the case of the Shannon
Hartley theorem, the noise is assumed to be generated by a Gaussian process with a known
variance. Since the variance of a Gaussian process is equivalent to its power, it is conventional
to call this variance the noise power.
Such a channel is called the Additive White Gaussian Noise channel, because Gaussian noise is
added to the signal; "white" means equal amounts of noise at all frequencies within the channel
bandwidth. Such noise can arise both from random sources of energy and also from coding and
measurement error at the sender and receiver respectively. Since sums of independent Gaussian
random variables are themselves Gaussian random variables, this conveniently simplifies
analysis, if one assumes that such error sources are also Gaussian and independent.
Implications of the theorem
Comparison of Shannon's capacity to Hartley's law -
Comparing the channel capacity to the information rate from Hartley's law, we can find the
effective number of distinguishable levels M:


30

The square root effectively converts the power ratio back to a voltage ratio, so the number of
levels is approximately proportional to the ratio of rms signal amplitude to noise standard
deviation.
This similarity in form between Shannon's capacity and Hartley's law should not be interpreted to
mean that M pulse levels can be literally sent without any confusion; more levels are needed, to
allow for redundant coding and error correction, but the net data rate that can be approached with
coding is equivalent to using that M in Hartley's law.

Alternative forms
Frequency-dependent (colored noise) case-
In the simple version above, the signal and noise are fully uncorrelated, in which case S + N is
the total power of the received signal and noise together. A generalization of the above equation
for the case where the additive noise is not white (or that the S/N is not constant with frequency
over the bandwidth) is obtained by treating the channel as many narrow, independent Gaussian
channels in parallel:

where
C is the channel capacity in bits per second;
B is the bandwidth of the channel in Hz;
S(f) is the signal power spectrum
N(f) is the noise power spectrum
f is frequency in Hz.
Note: the theorem only applies to Gaussian stationary process noise. This formula's way of
introducing frequency-dependent noise cannot describe all continuous-time noise processes. For
example, consider a noise process consisting of adding a random wave whose amplitude is 1 or
-1 at any point in time, and a channel that adds such a wave to the source signal. Such a wave's
frequency components are highly dependent. Though such a noise may have a high power, it is
fairly easy to transmit a continuous signal with much less power than one would need if the
underlying noise was a sum of independent noises in each frequency band.


31
Approximations -
For large or small and constant signal-to-noise ratios, the capacity formula can be approximated:
If S/N >> 1, then

where

Similarly, if S/N << 1, then

In this low-SNR approximation, capacity is independent of bandwidth if the noise is white,
of spectral density watts per hertz, in which case the total noise power is .

REDUNDANCY :
Redundancy in information theory is the number of bits used to transmit a message minus the number
of bits of actual information in the message. Informally, it is the amount of wasted "space" used to
transmit certain data. Data compression is a way to reduce or eliminate unwanted redundancy,
while checksums are a way of adding desired redundancy for purposes of error detection when
communicating over a noisy channel of limited capacity.
In describing the redundancy of raw data, the rate of a source of information is the average entropy per
symbol. For memoryless sources, this is merely the entropy of each symbol, while, in the most general
case of a stochastic process, it is

the limit, as n goes to infinity, of the joint entropy of the first n symbols divided by n. It is common
in information theory to speak of the "rate" or "entropy" of a language. This is appropriate, for
example, when the source of information is English prose. The rate of a memoryless source is
simply , since by definition there is no interdependence of the successive messages of a
memory less source.
The absolute rate of a language or source is simply


32
the logarithm of the cardinality of the message space, or alphabet. (This formula is sometimes
called the Hartley function.) This is the maximum possible rate of information that can be
transmitted with that alphabet. (The logarithm should be taken to a base appropriate for the
unit of measurement in use.) The absolute rate is equal to the actual rate if the source is
memory less and has a uniform distribution.
The absolute redundancy can then be defined as

the difference between the absolute rate and the rate.
The quantity is called the relative redundancy and gives the maximum possible data
compression ratio, when expressed as the percentage by which a file size can be
decreased. (When expressed as a ratio of original file size to compressed file size, the
quantity gives the maximum compression ratio that can be achieved.)
Complementary to the concept of relative redundancy is efficiency, defined as so
that . A memory less source with a uniform distribution has zero
redundancy (and thus 100% efficiency), and cannot be compressed.

A measure of redundancy between two variables is the mutual information or a normalized
variant. A measure of redundancy among many variables is given by the total correlation.
Redundancy of compressed data refers to the difference between the expected compressed data
length of messages (or expected data rate ) and the entropy (or
entropy rate ). (Here we assume the data is ergodic and stationary, e.g., a memoryless source.)
Although the rate difference can be arbitrarily small as increased, the actual
difference , cannot, although it can be theoretically upper-bounded by 1 in the
case of finite-entropy memoryless sources.

HUFFMAN CODING :
In computer science and information theory, Huffman coding is an entropy encoding algorithm used
for lossless data compression. The term refers to the use of a variable-length code table for encoding
a source symbol (such as a character in a file) where the variable-length code table has been derived
in a particular way based on the estimated probability of occurrence for each possible value of the
source symbol. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and
published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".


33
Huffman coding uses a specific method for choosing the representation for each symbol, resulting in
a prefix code (sometimes called "prefix-free codes", that is, the bit string representing some particular
symbol is never a prefix of the bit string representing any other symbol) that expresses the most
common source symbols using shorter strings of bits than are used for less common source symbols.
Huffman was able to design the most efficient compression method of this type: no other mapping of
individual source symbols to unique strings of bits will produce a smaller average output size when
the actual symbol frequencies agree with those used to create the code. The running time of Huffman's
method is fairly efficient, it takes operations to construct it. A method was later found to
design a Huffman code in linear time if input probabilities (also known as weights) are sorted.

We will now study an algorithm for constructing efficient source codes for a DMS with source
symbols that are not equally probable. A variable length encoding algorithm was suggested by
Huffman in 1952, based on the source symbol probabilities P(xi ) i=1,2,.,L. The algorithm is
optimal in the sense that the average number of bits it requires to represent the source symbols
is a minimum, and also meets the prefix condition. The steps of Huffman coding algorithm are
given below :


34


35


36


37


38


39


40

RANDOM VARIABLES :
A random variable, usually written X, is a variable whose possible values are numerical outcomes of a
random phenomenon. There are two types of random variables, discrete and continuous.
DISCRETE RANDOM VARIABLES :
A discrete random variable is one which may take on only a countable number of distinct values such as
0,1,2,3,4,........ Discrete random variables are usually (but not necessarily) counts. If a random variable can
take only a finite number of distinct values, then it must be discrete. Examples of discrete random
variables include the number of children in a family, the Friday night attendance at a cinema, the number
of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.
The probability distribution of a discrete random variable is a list of probabilities associated with each of
its possible values. It is also sometimes called the probability function or the probability mass function.

(Definitions taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)

Suppose a random variable X may take k different values, with the probability that X = xi defined to be P(X
= xi) = pi. The probabilities pi must satisfy the following:

1: 0 < pi < 1 for each i
2: p1 + p2 + ... + pk = 1.
Example :
Suppose a variable X can take the values 1, 2, 3, or 4.
The probabilities associated with each outcome are described by the following table:
Outcome 1 2 3 4


41
Probability 0.1 0.3 0.4 0.2
The probability that X is equal to 2 or 3 is the sum of the two probabilities: P(X = 2 or X = 3) = P(X = 2) +
P(X = 3) = 0.3 + 0.4 = 0.7. Similarly, the probability that X is greater than 1 is equal to 1 - P(X = 1) = 1 - 0.1
= 0.9, by the complement rule.
This distribution may also be described by the probability histogram shown:

CONTINUOUS RANDOM VARIABLES :
A continuous random variable is one which takes an infinite number of possible values. Continuous
random variables are usually measurements. Examples include height, weight, the amount of sugar in an
orange, the time required to run a mile.
(Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)

A continuous random variable is not defined at specific values. Instead, it is defined over an interval of
values, and is represented by the area under a curve (in advanced mathematics, this is known as an
integral). The probability of observing any single value is equal to 0, since the number of values which
may be assumed by the random variable is infinite.
Suppose a random variable X may take all values over an interval of real numbers. Then the probability
that X is in the set of outcomes A, P(A), is defined to be the area above A and under a curve. The curve,
which represents a function p(x), must satisfy the following:
1: The curve has no negative values (p(x) > 0 for all x)
2: The total area under the curve is equal to 1.
A curve meeting these requirements is known as a density curve.


42

A gaussian random variable is completely determined by its mean and variance.
The function that is frequently used for the area under the tail of the gaussian pdf
(Probability Distribution Function) is the denoted by Q(x).


43

The Q-function is a standard form for expressing error probabilities without a closed form

BOUNDS ON TAIL PROBABILITY :
General bounds on tail probability of a random variable (that is, probability that a random variable
deviates far from its expectation)

In probability theory, the Chernoff bound, named after Herman Chernoff, gives exponentially decreasing
bounds on tail distributions of sums of independent random variables. It is a sharper bound than the


44
known first or second moment based tail bounds such as Markov's inequality or Chebyshev inequality,
which only yield power-law bounds on tail decay. However, the Chernoff bound requires that the variates
be independent a condition that neither the Markov nor the Chebyshev inequalities require.


45

Information Theory, Coding and Cryptography Unit-1 by Arun Pratap Singh

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Information Theory, Coding and Cryptography Unit-1 by Arun Pratap Singh

Caricato da

Copyright:

Formati disponibili

UNIT : I

INFORMATION THEORY, CODING & CRYPTOGRAPHY (MCSE 202)

Potrebbero piacerti anche