Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
11.I Introduction
In 1988, Linsker (1988a) proposed a principle of self-organization that is rooted in
information theory. The principle is that the synaptic connections of a multilayered neural
network develop in such a way as to maximize the amount of information that is preserved
when signals are transformed at each processing stage of the network, subject to certain
constraints. The idea that information theory may offer an explanation for perceptual
processing is not new.’ For instance, we may mention an early paper by Attneave (1954),
in which the following information-theoretic function is proposed for the perceptual
system:
A major function of the perceptual machinery is to strip away some of the redundancy of
stimulation, to describe or encode information in a form more economical than that in which
it impinges on the receptors.
The main idea behind Attneave’s paper is the recognition that encoding of data from a
scene for the purpose of redundancy reduction is related to the identification of specific
features in the scene. This important insight is related to a view of the brain described
in Craik (1943), where a model of the external world is constructed so as to incorporate
the regularities and constraints of the world.
Indeed, the application of information-theoreticideas to the perceptual system has been
a subject of research almost as far back as Shannon’s 1948 classic paper, which laid down
the foundations of information theory. Shannon’s original work on information theory,2
and its refinement by other researchers, was in direct response to the need of electrical
engineers to design communication systems that are both efJicient and reliable. In spite
of its practical origins, information theory as we know it today is a deep mathematical
theory concerned with the very essence of the communicationprocess. Indeed, the theory
provides a framework for the study of fundamental issues such as the efficiency of
information representation and the limitations involved in the reliable transmission of
information over a communication channel. Moreover, the theory encompasses a multitude
of powerful theorems for computing ideal bounds on the optimum representation and
transmission of information-bearing signals; these bounds are important because they
For a review of the literature on the relation between information theory and perception, see Linsker
(1990b) and Atick (1992).
For detailed treatment of information theory, see the book by Cover and Thomas (1991); see also Gray
(1990). For a collection of papers on the development of information theory (including the 1948 classic paper
by Shannon), see Slepian (1973). Shannon’s paper is also reproduced, with minor revisions, in the book by
Shannon and Weaver (1949).
For a brief review of the important principles of information theory with neural processing in mind, see
Atick (1992). For a treatment of information theory from a biology perspective, see Yockey (1992).
444
11.2 / Shannon’s Information Theory 445
Can we find a measure of how much “information’’ is produced by the random variable
x taking on the discrete value xk? The idea of information is closely related to that of
“uncertainty” or “surprise,” as described next.
446 11 / Self-organizing Systems Ill: Information-Theoretic Models
Suppose that the event x = xk occurs with probability pk = 1, which therefore requires
that p i = 0 for all i # k. In such a situation there is no ‘‘surprise” and therefore no
“information” conveyed by the occurrence of the event x = xk, since we know what the
message must be. If, on the other hand, the various discrete levels were to occur with
different probabilities and, in particular, the probability P k is low, then there is more
‘‘surprise” and therefore “information” when x takes on the value X k rather than another
value xi with higher probability p i , i # k. Thus, the words “uncertainty,” “surprise,”
and “information” are all related. Before the occurrence of the event x = X k , there is an
amount of uncertainty. When the event x = xk occurs, there is an amount of surprise.
After the occurrence of the event x = xkrthere is a gain in the amount of information.
All these three amounts are obviously the same. Moreover, the amount of information is
related to the inverse of the probability of occurrence.
Entropy
We define the amount of information gained after observing the event x = xk with
probability pk as the logarithmic function
(11.4)
where the base of the logarithm is arbitrary. When the natural logarithm is used, the units
for information are nuts, and when the base 2 logarithm is used the units are bits. In any
case, the definition of information given in Eq. (1 1.4) exhibits the following properties:
1.
I(Xk) =O for p k = 1 (11.5)
Obviously, if we are absolutely certain of the outcome of an event, there is no
information gained by its occurrence.
2.
I(&) 2 0 for 0 5 Pk 5 1 (11.6)
That is, the less probable an event is, the more information we gain through its
occurrence.
The amount of information I(&) is a discrete random variable with probability pk. The
mean value of Z(xk) over the complete range of 2K + 1 discrete values is given by
H(x) = ELz(xk)l
(11.8)
11.2 / Shannon’s Information Theory 447
The quantity H(x) is called the entropy of a random variable x permitted to take on a
finite set of discrete values; it is so called in recognition of the analogy between the
definition given in Eq. (1 1.8) and that of entropy in statistical thermodynamics. The
entropy H(x) is a measure of the average amount of information conveyed per message.
Note, however, that the x in H(x) is not an argument of a function but rather a label for
a random variable. Note also that in the definition of Eq. (1 1.8) we take 0 log 0 to be 0.
The entropy H(x) is bounded as follows:
0 5 H(x) 5 log(2K + 1) (11.9)
where (2K + 1) is the total number of discrete levels. Furthermore, we may make the
following statements:
1. H(x)= 0 if and only if the probability Pk = 1 for some k, and the remaining
probabilities in the set are all zero; this lower bound on entropy corresponds to no
uncertainty.
2. H(x) = log2(2K + l), if and only if pk = 142K + 1) for all k tie., all the discrete
levels are equiprobable); this upper bound on entropy corresponds to maximum
uncertainty.
The proof of property 2 follows from the following lemma (Gray, 1990):
Given any two probability distributions { p k }and {qk}for a discrete random variable
x, then
(11.10)
The quantity used in this lemma is of such fundamental importance that we pause to
recast it in a form suitable for use in the study of stochastic networks. Let Pa and Qa
denote the probabilities that a network is in state a under two different operating conditions.
The relative entropy of the probability distributionPa with respect to the second probability
distribution Qa is defined by (Gray, 1990)
(1 1.11)
where the sum is over all possible states a of the network. The probability distribution
Qa plays the role of a reference measure. The idea of relative entropy was introduced
by Kullback (1968); it is also referred to in the literature by other names such as the
Kullback-Leibler information criterion, cross entropy, and directed divergence. The rela-
tive entropy of Eq. (1 1.11) was used in Chapter 6 to derive a special form of the back-
propagation algorithm, and also in Chapter 8 for the derivation of Boltzmann learning.
about x. How can we measure the uncertainty about x after observing y? In order to
answer this question, we define the conditional entropy of x given y as follows (Gray,
1990):
H(XIY) = H(X,Y) - W Y ) (11.12)
with the property that
0 5 H(xly) 5 H ( x ) (11.13)
The conditional entropy H(xly ) represents the amount of uncertainty remaining about the
system input x after the system output y has been observed.
Since the entropy H(x) represents our uncertainty about the system input before observ-
ing the system output, and the conditional entropy H(xly) represents our uncertainty about
the system input after observing the system output, it follows that the difference
H(x) - H(x1y) must represent our uncertainty about the system input that is resolved by
observing the system output. This quantity is called the auerage mutual information
between x and y. Denoting it by Z(x; y), we may thus write3
I(x;y) = H(x) - H(XlY) (11.14)
Entropy is a special case of mutual information, since we have
H(x) = Z(x;x) (11.15)
The mutual information Z(x; y ) between two discrete random variables x and y has the
following properties (Cover and Thomas, 1991; Gray, 1990).
The mutual information between x and y is symmetric; that is,
4 y ; x ) = I(x;y) (11.16)
where the mutual information Z(y; x) is a measure of the uncertainty about the
system output y that is resolved by observing the system input x, and the mutual
information Z(x; y ) is a measure of the uncertainty about the system input that is
resolved by observing the system output.
The mutual information between x and y is always nonnegative; that is,
Z(x;y) 2 0 (11.17)
This property, in effect, states that we cannot lose information, on the average, by
observing the system output y . Moreover, the mutual information is zero if and only
if the input and output of the system are statistically independent.
The mutual information between x and y may be expressed in terms of the entropy
of Y as
m:Y)=WY)-WYb) (11.18)
where H(ylx) is a conditional entropy. The right-hand side of Eq. (11.18) is the
ensemble average of the information conveyed by the system output y , minus the
ensemble average of the information conveyed by y given that we already know
the system input x. This latter system output y conveys information about the
processing noise, rather than about the system input x.
Figure 11.1 provides a diagrammatic interpretation of Eqs. (11.14) and (1 1.18). The
entropy of the system input x is represented by the circle on the left. The entropy of the
The term I(x; y) was originally referred to as the rate of information transmission by Shannon (1948).
Today, however, this term is commonly referred to as the mutual information between the random variables x
and y.
11.2 I Shannon’s Information Theory 449
mx,Y )
H(x)
f H(Y)
FIGURE 11.1 Illustration of the relations among the mutual information / ( x ; y) and the
entropies H(x) and H(y).
system output y is represented by the circle on the right. The mutual information between
x and y is represented by the overlap between these two circles.
= -1”
-m
f(x) log f(x) dx - lim log Sx
sx+o -m
f(x) dx
where, in the last line, we have made use of Eq. (11.19) and the fact that the total area
under the curve of the probability density function f ( x ) is unity. In the limit as Sx
approaches zero, -log Sx approaches infinity. This means that the entropy of a continuous
random variable is infinitely large. Intuitively, we would expect this to be true, because
a continuous random variable may assume a value anywhere in the open interval (- 0 3 , ~ )
and the uncertainty associated with the variable is on the order of infinity. We avoid the
problem associated with the term log 6x by adopting h(x) as a differential entropy, with
the term -log Sx serving as reference. Moreover, since the information processed by a
dynamical system is actually the differencebetween two entropy terms that have a common
reference, the information will be the same as the difference between the corresponding
differential entropy terms. We are therefore perfectly justified in using the term h(x),
defined in Eq. (1 1.19), as the differential entropy of the continuous random variable x.
When we have a continuous random vector x consisting of n random variables xlrx2,
. . . , x,, we define the differential entropy of x as the nlfold integral
(11.22)
I”-m
f ( x ) dx = 1 (11.23)
and
1”--m
(x - p)’f(x) dx = u2= constant (11.24)
is stationary. The parameters XI and X2 are known as Lagrange multipliers. That is,
h(x) is maximum only when the derivative of the integrand
-f(x> log f ( x ) + X,f(X) + X2(x - A 2 f ( X )
11.2 I Shannon’s Information Theory 451
(11.26)
(1 1.29)
(11.33)
(11.34)
11.3 / The Principle of Maximum Information Preservation 453
where wi is the synaptic weight from source node i in the input layer to the neuron in the
output layer, and v is the processing noise, as modeled in Fig. 11.2. It is assumed that:
The output y of the neuron is a Gaussian random variable with variance u:.
The processing noise v is also a Gaussian random variable with zero mean and
variance u!.
The processing noise is uncorrelated with any of the input components, that is,
E[vxi] = 0 for all i (1 1.35)
To proceed with the analysis, we first note from the second line of Eq. (1 1.30) that
the average mutual information Z(y; x) between the output y of the neuron in the output
layer and the input vector x is given by
I(y;x) = h(Y) - h(YlX) (1 1.36)
In view of Eq. (1 1.34), we note that the probability density function of y, given the input
vector x, is the same as the probability density function of a constant plus a Gaussian-
distributed random variable v. Accordingly, the conditional entropy h( y(x)is the “informa-
tion” that the output neuron conveys about the processing noise v rather than about the
signal vector x. We may thus write
h(YIX) = h(v) (1 1.37)
and therefore rewrite Eq. (11.36) simply as
4 y ; x ) = M Y ) - h(v) ( 1 1.38)
Applying Eq. (1 1.27) for the differential entropy of a Gaussian random variable to the
problem at hand, we have
1
h( y) = - [1 + lOg(27ru;)l (11.39)
2
and
1
h(v) = - [l
2
+ log(2nu2y)l (1 1.40)
Hence the use of Eqs. (1 1.39) and (1 1.40) in (1 1.38) yields, after simplification (Linsker,
1988a),
(1 1.41)
where each noise V,is assumed to be an independent Gaussian random variable with zero
mean and variance a;. We may rewrite Eq. (11.42) in a form similar to that of Eq.
(1 1.34), as shown by
y = 2 wjxj) +
(ir,
Vr (1 1.43)
(11.45)
VL
As before, we assume that the output y of the neuron in the output layer has a Gaussian
distribution with variance a:. The average mutual information I( y ; x) between y and x
is still given by Eq. (11.36). This time, however, the conditional entropy h(ylx) is defined
by
h(Yh =Nv’)
1
= - (1
2
+ 2aa2,t)
=2 [
1 1 + 2aa: (2wiy] (11.46)
Thus, using Eqs. (11.39) and (11.45) in (11.36), and simplifying terms, we get (Linsker,
1988a)
(1 1.47)
Under the constraint that the noise variance a; is maintained constant, the mutual informa-
tion Z(y; x) is now maximized by maximizing the ratio a:/($=l wi)’.
These two simple models illustrate that the consequences of the principle of maximum
information preservation are indeed problem-dependent. We next turn our attention to
other issues related to the principle.
(11.48)
and
(11.49)
where the wli are the synaptic weights from the input layer to neuron 1 in the output
layer, and the wPiare similarly defined. Note that the model of Fig. 11.4 may be expanded
to include lateral connections between the output nodes, the effect of which is merely to
modify the weights in the linear combinations of Eqs. (11.48) and (11.49). To proceed
with the analysis, we make the following assumptions (Linsker, 1988a):
The additive noise terms vl and v2 are each Gaussian distributed, with zero mean
and common variance a;. Moreover, these two noise terms are uncorrelated with
each other, as shown by
E [ z + v ~=
] 0 (11.50)
rn Each noise term is uncorrelated with any of the input signals, as shown by
E[v,xj] = 0 for all i and j (11.51)
rn The output signals y1 and yz are both Gaussian random variables with zero mean.
456 1 1 / Self-Organizing Systems Ill: Information-TheoreticModels
The issue we wish to address is to see how the principle of maximum information
preservation may be used to understand the behavior of the model so described.
The average mutual information between the output vector y = [yI,y2ITand the input
vector x = [xl,x2, . . . , XLITis defined by [see second line of Eq. (1 1.30)]
I(x; Y) = M Y ) - h(YlX) (11.52)
The noise model shown in Fig. 11.4 is an extension of the noise model of Fig. 11.2.
Hence, using a rationale similar to that presented for the model of Fig. 11.2, we may
rewrite Eq. (11.52) in the form
0;Y) = MY) - (11.53)
The differential entropies h(u) and h(y) are each considered in turn; the noise vector u
is made up of the elements y and y .
Since the noise terms y and v2 are assumed to be Gaussian and uncorrelated, it follows
that they are also statistically independent from each other. For the differential entropy
h(u), we therefore have
h(v) = h ( V , , % )
= h(vd + Wv2)
= 1 + log(27m2,) (11.54)
where, in the last line, we have made use of Eq. (11.40).
The differential entropy of the output vector y is given by
(1 1.55)
where f(yl,yz) is the joint probability density function of y1 and y2. According to the
model of Fig. 11.4, the output signals y1 and y2 are both dependent on the same set of
input signals; it follows, therefore, that y1 and y2 can be correlated with each other. Let
R denote the correlation matrix of the output vector:
R = E[YYT1
(11.56)
11.3 / The Principle of Maximum Information Presetvation 457
where
rij = E [ y i y j ] , i , j = 1, 2 (1 1.57)
Using the definitions given in Eqs. (11.48) and (11.49), and invoking the assumptions
made in Eqs. (1 1.50) and (1 l S l ) , we find that the individual elements of the correlation
matrix R have the following values:
rll = ai + a$ (11.58)
where det(R) is the determinant of the correlation matrix R, and R-' is its inverse. The
differential entropy of the N-dimensional vector y so described is given by (Shannon,
1948)
h(y) = 10g((271e)"~det(R)) (11.62)
where e is the base of the natural logarithm. For the problem at hand, we have N = 2.
Hence, for the output vector y made up of the elements y1 and y2, Eq. (1 1.62) reduces to
h(y) = 1 + log(271 det(R)) (11.63)
Accordingly, the use of Eqs. (1 1.54) and (1 1.63) in (1 1.53) yields (Linsker, 1988a)
To maximize the mutual information Z(x; y) between the output vector y and the input
vector x for a fixed noise variance a',,we must maximize the determinant of the 2-by-2
correlation matrix R. From Eqs. (11.58) to (11.59) we note that
det(R) = rllrzz - r12r-21
= a$+ a',((+;
+ f f t )+ &at(l - pTz) (11.65)
Depending on the prevalent value of the noise variance a',,we may identify two distinct
situations (Linsker, 1988a), as described here.
1. Large Noise Variance. When the noise variance a',is large, the third term on the
right-hand side of Eq. (11.65), which is independent of a',,may be neglected. In such a
case, maximizing det(R) requires that we maximize (a: + at). In the absence of any
specific constraints, this requirement may be achieved simply by maximizing the variance
a: of the output y1 or the variance a$of the second output y2, separately. Since the
variance of the output y i , i = 1, 2, is equal to a: in the absence of noise and a: + a:
in the presence of noise, it follows that, in accordance with the principle of maximum
information preservation, the optimum solution for a fixed noise variance is to maximize
the variance of either output, y1 or y2.
458 11 / Self-organizing Systems 111: Information-TheoreticModels
2. Low Noise Variance. When, on the other hand, the noise variance a$is small, the
third term aioi(1 - &) on the right-hand side of Eq. (11.65) becomes particularly
important relative to the other two terms. The mutual information Z(x; y) is then maximized
by making an optimal trade-off between two options: keeping the output variance a: and
aflarge, and making the outputs y 1 and y2 of the two output neurons uncorrelated.
Based on these observations, we may therefore make the following two statements
(Linsker, 1988a).
w A high noise level favors redundancy of response, in which case the two output
neurons in the model of Fig. 11.4 compute the same linear combination of inputs,
given that there is only one such combination that yields a response with maximum
variance.
A low noise level favors diversity of response, in which case the two output neurons
in the model of Fig. 11.4 compute different linear combinations of inputs, even
though such a choice may result in a reduced output variance.
(11.67)
where cjkis the excitatory strength of the lateral connection from neuron k to neuron j ;
it is specified, subject to two requirements:
cjk 2 0 for a l l j and k (11.68)
and
(1 1.69)
Lateral inhibitory connections are implicit in the selection of a single firing neuron at
step 3.
The use of Eq. (1 1.67) in (11.70) yields
CikYk
P(ilx) =
cjkYk
j k
-
--k
(11.71)
E
k
Yk
where, in the second line, we have made use of Eq. (11.69). According to Eq. (11.71),
the conditional probability that neuron i is selected for firing, given the input vector x,
equals the normalized average strength of the active lateral connections to neuron i, with
each excitatory lateral strength c i k weighted by the feedforward response Y k of neuron k.
The average mutual information Z(x; { j } ) between the input vector x and the set of
output neurons { j } is defined by
(11.72)
The probability P(j ) that node j has fired is defined in terms of the conditional probability
P ( j l x ) by
the average mutual information Z(x; { j}). By so doing, the values of the output signals of
the network discriminate optimally, in an information-theoreticsense, among the possible
sets of input signals applied to the network.
To be specific, we wish to derive a learning rule that, when averaged over the input
ensemble, performs gradient ascent on Z(x; { j}).The derivative of Z(x; {j}) with respect
to the center ti of neuron i is given by (Linsker, 1989b)
(1 1.75)
(1 1.76)
In order to reduce boundary effects, periodic boundary conditions are imposed on the
formation of the map. Specifically, the input space and output space are each regarded
as the su$ace o f a torus (Linsker, 1989b); this kind of assumption is commonly made in
image processing. Accordingly, in calculating the Euclidean distance IIx - t,JIwe go the
“short way around” the input space or output space. A similar remark applies to the
calculation of the derivative a y , / d t , .
The learning rule for the map generation may now be formulated as follows (Linsker,
1989b).
rn To initialize the algorithm, select the center t, of neuron i from a uniform distribution
on a square of small side s = 0.7 centered at (O.l,O.l), say.
rn Select input presentations x from the ensemble X according to the probability density
functionsf(x).
rn For each input vector in turn, modify the center t, of neuron i by a small amount
vz,(x), where 17 is a small positive constant.
Note that if the centers tJ are permitted to change slowly over many presentations, the
firing incidence of neuronj averaged over an appropriate number of recent input presenta-
tions may provide a suitable approximation to P( j).
For an interpretation of Eq. (11.76), consider neuron j = 1, 2, . . . , N, where N is the
total number of neurons in the network. Let it be supposed that for neuron j we have
(Linsker, 1989b)
1. P( jlx) > P( j);that is, the occurrence of input pattern x is conducive to the firing
of neuron j
2. cl, > P(j1x); that is, the excitatory lateral connection from neuron i to neuron j is
stronger than the conditional probability that neuron j would fire given the input x.
Under these two conditions, term j in Eq. (11.76) has the effect of tending to move the
center t, of neuron i in the direction of increase of the output signal y , of that neuron. If,
on the other hand, the reverse of condition 2 holds, the effect of term j tends to decrease
the output signal y , .
We may thus sum up the essence of the information-theoretic learning rule described
here as follows (Linsker, 1989b):
There are three distinct elements apparent in this statement: self-amplijkation (i.e., Hebb-
like modijication), cooperation, and competition among the output nodes of the network
for “territory” in the input space (Linsker, 1989b). It is rather striking that a learning
rule rooted in information theory would embody the three basic principles of self-organiza-
tion that were identified in Section 9.2.
11.5 Discussion
Shannon’s information theory is endowed with some important concepts and powerful
theorems that befit them for the study of different aspects of neural networks. In Chapter
6 we used the idea of relative entropy to derive a special form of the back-propagation
algorithm. In Chapter 8 we used it to derive the Boltzmann learning rule; this latter rule
is basic to the operation of the Boltzmann and mean-field-theory machines.
Information theory is well equipped to explain many aspects of sensory processing
(Atick, 1992). Consider, for example, the information bottleneck phenomenon that may
arise at some point along a sensory pathway, which means that somewhere along the
462 1 1 / Self-organizing Systems 111: Information-Theoretic Models
pathway there exists a restriction on the rate of data flow onto higher levels. This phenome-
non may arise because of a limitation on the bandwidth or dynamic range of a neural
link, which is not unlikely given the limited resources available to neurons in the human
brain. Another way in which an information bottleneck can arise is the existence of a
computational bottleneck at higher levels of the sensory pathway, imposing a limitation
on the number of bits of data per second needed to perform the requisite analysis.
In previous sections of this chapter we focused much of our attention on mutual
information, which provides the theoretical basis of Linsker’s principle of maximum
information preservation. According to this principle, any layer of a perceptual system
should adapt itself so as to maximize the information conveyed to that layer about the
input patterns. It is striking that, as demonstrated in Section 11.4, this principle exhibits
properties of Hebb-like modification, and cooperative and competitive learning, combined
in a particular way, even though no assumptions are made concerning the form of the
learning rule or its component properties (Linsker, 1989b).
Plumbley and Fallside (1988) change the emphasis in the principle of maximum
information preservation slightly by redefining it as a minimization of information loss.
Expressed in this form, they apply a minimax approach to the information loss, relating
it to the more familiar criterion of minimizing mean-squared error. In particular, they use
this approach to analyze a linear network with a single layer of neurons that performs
dimensionality reduction. The analysis assumes the presence of additive Gaussian noise.
The conclusion drawn from this analysis is that the information loss due to the input-output
mapping is upper-bounded by the entropy of the reconstruction error. Hence the entropy
is minimized by minimizing the mean-square value of the reconstruction error. The net
result is that the information loss limitation problem is changed into one related to principal
components analysis; the subject of principal components analysis was discussed in Chap-
ter 9.
In a subsequent paper, Plumbley and Fallside (1989) examine the goals of the early
stages of a perceptual system before the signal reaches the cortex, and describe their
operation in information-theoretic terms. It is argued that in such a system, optimal
transmission of information involves the temporal and spatial decorrelation of the signal
before transmission. That is, the temporal and spatial power spectra should be uniform
up to the available bandwidth. Recognizing that most real-world signals contain a large
proportion of low spatial and temporal frequencies, it is noted that the processes of
temporal adaptation and spatial lateral inhibition are required to filter the input signal so
as to produce the desired flat output power spectrum.
In Chapter 9 we also discussed how Linsker’s layered feedforward model, based
on Hebbian learning, may be used to explain the emergence of different receptive fields
in early visual systems. The results presented in that chapter were all based on com-
puter simulations. Tang (1991) has extended Linsker’s layered model by using infor-
mation-theoretic ideas to derive analytic solutions for the problems posed by Linsker’s
model. Tang’s theory builds on the principle of maximum information preservation.
Moreover, it is proposed that the layered network may be an active multimode
communication channel that participates in the decision-making process, with the result
that some modes of the input are selected for their relevant functional characteristics.
Applications of the theory to edge detection and motion detection are demonstrated in
Tang (1991).
Linsker (1993) extends the principle of maximum information preservation by apply-
ing it to a class of weakly nonlinear input-output mappings. The weakly nonlinear input-
output relation considered by Linsker is defined by
y j = vj + &Uj + i
wjivi + vj (1 1.77)
11.5 / Discussion 463
where yj is the output of neuronj and uj is the internal activity level of neuronj defined
in terms of the inputs xi by
vj = wjixi (11.78)
I
The E is assumed to be a small constant. The noise at the ith input of neuronj is denoted
by viand that at its output by v,’ . On the basis of this nonlinear, noisy model of a neuron,
it is shown that the applicationof the principle of maximum informationpreservation results
in the generation of optimum filters with some interrelated properties, as summarized here
(Linsker, 1993):
H A type of sparse coding exemplified by the concentration of activity among a
relatively small number of neurons
H Sensitivity to higher-order statistics of the input
H Emergence of multiresolution capability with subsampling at spatial frequencies, as
favored solutions for certain types of input ensembles.
Such a model may be useful for a better understanding of the formation of place-coded
representations in biological systems (Linsker, 1993).
Contextual Supervision
Kay (1992) has considered a neural network for which the input data may be divided into
two sets: a primary input denoted by vector x1 of dimension p l , and a contextual input
denoted by vector x2 of dimension p 2 . In practice, there would be several realizations of
the overall input vector x = (xlrx2)available for use. The goal of the exercise is to
discover the structure of the information contained in the primary vector x1 in light of
the contextual input vector x2used as a “teacher.” The key question is: Given the vectors
x1 and x2,how do we proceed? As a first step, we may wish to find those linear functions
of the primary input vector x1 that are maximally correlated with the contextual input
vector x2. The strength of this relationship is measured using the average mutual informa-
tion between x1 and x2. Let
yi = a?xl (11.79)
and
zi= bTx2 (11.80)
A case of interest arises when the input vector x is random, and it follows a multivariate,
elliptically symmetric probability model of the form
cf((x - p)TZ-’(x - p)), xE [WPI+PZ (11.81)
where c is a constant, p and 2 are the mean vector and covariance matrix of x, respectively,
andf(.) is a real-valued function. The covariance matrix 2 may be written as
(11.82)
where
211 = cov[x,]
z
2
2 = cov[x21
212 = cov~x1,x2I= 2;1
464 11 / Self-Organizing Systems 111: Information-Theoretic Models
It is assumed that the covariance matrix 2 is positive definite, which implies that the
inverse matrices and exist. For the particular situation described here, it may be
shown that the mutual information Z(xl;x2) between x1 and x2 is linked to the mutual
information Z(yi; zi) between yi and zi as follows (Kay, 1992):
where r is the rank of the cross-covariance matrix Z12,and pi is the normalized correlation
coefficient between yl and z, for i = 1, 2, . . . , r. The set of correlations {pili = 1, 2, . . . ,
r} are called the canonical correlations between x1and x2 (Rao, 1973). The definition of
canonical correlations hinges on the mutual uncorrelatedness of the pairs { y l , zl Ii = 1, 2,
. . . , r}. The property described in Eq. (11.83) may be exploited to form a simple index
for finding the number of pairs required to summarize the structure of linear dependence
between the input vectors x1 and x2. Following the earlier work of Sanger (1989b) on
principal components analysis that was described in Section 9.7, Kay (1992) proposes a
neural network for computing the canonical correlations, based on the following pair of
recursive equations:
+ 1) = adn) + rl(n)zl(n)[gl-l(Y)- yf(n)lx,(n) (11.84)
bl(n + 1) = bLn) + ~Kn)y~(n)[h,-l(z)
- zf(n>l~2(n) (11.85)
where n denotes discrete time, ~ ( nis) a learning-rate parameter that tends to zero at a
suitable rate as n -+ 00, and
gl(Y) = g,-l(Y) - Y T ( 4 (11.86)
hl(d = h,-t(z) - z f ( 4 (11.87)
For the initialization of the algorithm, we set
go(Y) = M z ) = 1 (1 1.88)
To realize a local implementation of the recursions described in Eqs. (1 1.84) and (11.85),
we may arrange the {y,(n),z,(n)}in pairs, and then update a,(n) and b,(n) using outputs
yl(n) and z,(n) plus a local lateral connection between neighboring output pairs. The ith
canonical correlation pl(n) is obtained from the product y,(n)z,(n).The feature discovery
procedure under contextual supervision described here has potential applications involving
the fusion of data from two different two-dimensional projections of a three-dimensional
object, or the integration of visual and aural data recorded from the same stimulus (Kay,
1992).
higher-order features that exhibit simple coherence across space in such a way that the
representation of information in one spatially localized patch of the image makes it easy
to produce the presentation of information in neighboring patches. Becker and Hinton
use this approach to design a network made up of a number of modules receiving inputs
from nearby patches of the image, with each module learning to discover higher-order
features for displaying spatial coherence.
Consider Fig. 11.5, which shows two modules receiving inputs from adjacent, nonover-
lapping patches of an image. Let y , and Y b denote the outputs of the two modules.
The unsupervised learning procedure developed by Becker and Hinton (1992) for image
processing is to maximize the mutual information between the network outputs y , and Y b .
This mutual information, in light of the last line of Eq. (11.30), is defined by
Z(ya;yb) = h(y,) + h(yb) - h(y,,yb) (11.89)
where h( y,) and h( yb) are the entropies of y , and yb, respectively, and h( y, ,yb) is their
joint entropy. The requirement is to maximize the mutual information Z( y, ;Y b ) with respect
to the synaptic weights of both modules in Fig. 11.5. Under Gaussian assumptions, a
simple objective function based on Z( y, ;yb) can be derived for the self-organized learning
process. Specifically, it is assumed that both modules receive input that is produced by
some common signal s corrupted by independent Gaussian noise in each input patch, and
that the modules transform the input into outputs y , and yb that are noisy versions of the
common signal s, as shown by
y, = s + v, (11.90)
and
yb = s + vb (11.91)
where v, and vb are the additive noise components. According to this model, the two
-
modules make consistent assumptions about each other. It may be readily shown that the
Maximize
mutual information Kya;yb)
yll yb
t t
Image Image
patch patch
1 2
466 11 / Self-Organizing Systems 111: Information-Theoretic Models
(11.92)
where var[.] denotes the variance of the enclosed random variable. Under the same model,
a simpler measure of performance that appears to work in practice as well as I( yo;yb) is
the following (Becker, 1992):
(11.93)
The mutual information I* defines the amount of information the average of the outputs
ya and Yb conveys about the common underlying signal s @e., the particular feature that
is coherent in the two input patches). By maximizing the mutual information I*, the two
modules are forced to extract as pure a version of the underlying common signal as
possible. An algorithm based on this maximization has been used by Becker and Hinton
(1992) to discover higher-order features such as stereo disparity and surface curvature in
random-dot stereograms. Moreover, under the assumption of a Gaussian mixture model
of the underlying coherent feature, the algorithm has been extended to develop population
codes of spatially coherent features such as stereo disparity (Becker and Hinton, 1992),
and to model the locations of discontinuities in depth (Becker and Hinton, 1993); Gaussian
mixture models were discussed in Section 7.10.
It is of interest to note that in the case of linear networks assuming a multidimensional
Gaussian distribution, the self-organizing model due to Becker and Hinton is equivalent
to canonical correlations between the inputs (Becker, 1992). A similar result is obtained
in the model described in Section 11.5 due to Kay (1992). However, while Kay's model
is an attractive model for self-organization because of its simple learning rule, the Becker-
Hinton model is more general in that it can be applied to multilayer networks.
Signal Separation
Inspired by the earlier work of Becker and Hinton (1989, 1992) on the extraction of
spatially coherent features just described, Ukrainec and Haykin (1992) have successfully
developed an information-theoretic model for the enhancement of a polarization target in
dual-polarized radar images. The sample radar scene used in the study is described as
follows. An incoherent radar transmits in a horizontally polarized fashion, and receives
radar returns on both horizontal and vertical polarization channels. The target of interest
is a cooperative, polarization-twisting rejector designed to rotate the incident polarization
through 90". In the normal operation of a radar system, the detection of such a target is
made difficult by imperfections in the system as well as reflections from unwanted polari-
metric targets on the ground (i.e., radar clutter). It is perceived that a nonlinear mapping
is needed to account for the non-Gaussian distribution common to radar returns. The
target enhancement problem is cast as a variational problem involving the minimization
of a quadratic cost functional with constraints. The net result is a processed cross-polarized
image that exhibits a significant improvement in target visibility, far more pronounced
than that attainable through the use of a linear technique such as principal components
analysis.
The model used by Ukrainec and Haykin assumes Gaussian statistics for the transformed
data, since a model-free estimate of the probability density function is a computationally
challenging task. The mutual information between two Gaussian variables ya and Yb is
defined by (see Problem 11.4):
11.6 / Spatially Coherent Features 467
Ga&ian
radial-basis
functions
FIGURE 11.6 Block diagram of a neural processor, the goal of which is to suppress
background clutter using a pair of polarimetric, noncoherent radar inputs; clutter
suppression is attained by minimizing the average mutual information between the
outputs of the two modules.
1
5
~ ( Y ;uY b ) = - lOg(1 - d b ) (1 1.94)
where Pab is the correlation coefficient between the two random variables ya and Y b . To
learn the synaptic weights of the two modules, a variational approach is taken. The
requirement is to suppress the radar clutter that is common to the horizontally polarized
and vertically polarized radar images. To satisfy this requirement, we minimize the mutual
information I( yu ;Y b ) , subject to a constraint imposed on the synaptic weights as shown
by
P = (tr[W'W] - 1)2 (11.95)
where W is the overall weight matrix of the network, and tr[.] is the trace of the enclosed
matrix. A stationary point is reached when we have
FIGURE 11.7a Raw B-scan radar images (azimuth plotted versus range) for
horizontal-horizontal (top) and horizontal-vertical(bottom) polarizations.
functions were chosen at evenly spaced intervals to cover the entire input domain, and
their widths were chosen using a heuristic.
Figure 11.7a shows the raw horizontally polarized and vertically polarized (both on
receive) radar images of a parklike setting on the shore of Lake Ontario. The range
coordinate is along the horizontal axis of each image, increasing from left to right; the
azimuth coordinate is along the vertical axis, increasing down the image. Figure 11.7b
shows the combined image obtained by minimizing the mutual information between the
horizontally and vertically polarized radar images, as just described. The bright spot
clearly visible in this image correspondsto the radar return from a cooperative,polarization-
twisting reflector placed along the lake shore. The clutter-suppressionperformance of the
information-theoretic model described here exceeds that of commonly used projections
using principal components analysis (Ukrainec and Haykin, 1992).
11.7 / Another Information-Theroretic Model of the Perceptual System 469
r - - - - - I
, - - - - - I
I I , I
s Y
(optic nerve)
(Barlow, 1989). The idea behind the approach is based on the fact that light signals
arriving at the retina contain useful sensory information in a highly redundant form.
Moreover, it is hypothesized that the purpose of retinal signal processing is to reduce or
eliminate the “redundant” bits of data due to both correlations and noise, before sending
the signal along the optic nerve. To quantify this notion, Atick and Redlich introduced
the redundancy measure
(11.99)
where Z(y; s) is the mutual information between y and s, and C(y) is the channel capacity
of the optic nerve (output channel). Equation (11.99) is justified on the grounds that the
information the brain is interested in is the ideal signal s, while the physical channel
through which this information must pass is in reality the optic nerve. It is assumed
that there is no dimensionality reduction in the input-output mapping performed by the
perceptual system, which means that C(y) > Z(y;s). The requirement is to find an
input-output mapping (Le., matrix A) that minimizes the redundancy measure R, subject
to the constraint of no information loss, as shown by
Z(y; x) = Z(x; x) - E (11.100)
where E is some small positive parameter. The channel capacity C(y) is defined as the
maximum rate of information flow possible through the optic nerve, ranging over all
probability distributions of inputs applied to it, and keeping the average input power fixed.
When the signal vector s and the output vector y have the same dimensionality, and
there is noise in the system, Atick and Redlich’s principle of minimum redundancy
and Linsker’s principle of maximum information preservation are essentially equivalent,
provided that a similar constraint is imposed on the computational capability of the output
neurons in both cases (Linsker, 1992). To be specific, suppose that [as in Atick and
Redlich (199Ob)l the channel capacity is measured in terms of the dynamic range of the
output of each neuron in the model of Fig. 11.8. Then, according to the principle of
minimum redundancy, the quantity to be minimized is
l-- 4Y; SI
C(Y>
for a given permissible information loss, and therefore for a given Z(y; s). Thus the quantity
to be minimized is essentially
F,(y;s) = C(Y) - w Y ; s ) (11.101)
On the other hand, according to the principle of maximum information preservation, the
quantity to be maximized in the model of Fig. 11.8 is
(11.102)
Although the functions Fl(y; s) and F,(y; s) are different, their optimizationsyield identical
results: They are both formulations of the method of Lagrange multipliers, with the roles
of Z(y; s) and C(y) being simply interchanged.
Thus, despite the differences between the principle of minimum redundancy and the
principle of maximum information preservation, they lead to similar results. It is of interest
to note that in an independent study, Nadal and Parga (1993) reach a similar conclusion;
they consider a single-layer perceptron (using the McCulloch-Pitts model of a neuron)
in the context of an unsupervised learning task.
Problems 471
PROBLEMS
11.1 Derive the properties of the average mutual informationZ(x; y ) between two continu-
ous-valued random values x and y as shown in Eqs. (11.30) through (1 1.32).
11.2 Consider two channels whose outputs are represented by the random variables x
and y . The requirement is to maximize the mutual information between x and y .
Show that this requirement is achieved by satisfying two conditions:
(a) The probability of occurrence of x or that of y is 0.5.
(b) The joint probability distribution of x and y is concentrated in a small region
of the probability space.
11.3 Prove the result described in Eq. (11.62) for the relative entropy of an N-dimensional
Gaussian distribution.
11.4 Consider two random variables, x1and x2,that are noisy versions of some underlying
signal s, as described by
yu=s+ vu, Yb=S+ vb
where vu and v b denote additive, independent, and zero-mean Gaussian noise compo-
nents.
(a) Show that the average mutual information between yu and Yb is given by
1
I(Ya ;yb) = - - lOg(1 - p2b)
2
where pub is the correlation coeficient between yu and y b ; that is,
472 11 I Self-organizing Systems 111: Information-TheoreticModels
(b) What is the condition for which the average mutual information Z(y,;yb) is
(i) maximum and (ii) minimum?
11.5 Consider the average mutual information Z(x; { j } ) defined in Eq. (1 1.72). Show
that the derivative of Z(x; { j } ) with respect to center ti of node i is as shown in Eq.
(1 1.75) with vector zi(x) defined in Eq. (1 1.76).
11.6 Show that the topologically ordered map produced by applying the principle of
maximum information preservation satisfies the ideal condition of Eq. (10.29) for
the map magnification factor. You may refer to Linsker (1989b), where this issue
is discussed in detail.
11.7 In this problem we use the information-theoretic rule described in Section 11.5 to
generate a topologically ordered map for a two-dimensional uniform distribution.
At each iteration of the algorithm, the center tiJof node (i,j) is changed by a small
amount
l K
At. . = - i, j = 1,2, . . . , 10
‘J K k = l zY(xk),
where {xk}is an ensemble of input vectors; see Eq. (1 1.76). The parameter values
are as follows (Linsker, 1989b):
Ly = 20
p=;
K = 900
Starting with a random selection of centers tij, plot the map produced by the
learning rule after 0, 10, 15, and 40 iterations.
11.8 Consider an input vector x made up of a primary component x1 and a contextual
component x2. Define
yi = aTx,
zi = bTX2
Show that the mutual information between x1 and x2 is related to the mutual
information between yi and zi as described in Eq. (11.83). Assume the probability
model of Eq. (11.81) for the input vector x.
11.9 Equation (11.92) defines the mutual information between two continuous random
variables yo and y b , assuming the Gaussian model described by Eqs. (11.90) and
(11.91). Derive the formula of Eq. (11.92).
Equation (1 1.93) provides a simpler measure of performance than that of Eq.
(11.92). Derive the alternative formula of Eq. (11.93).