Sei sulla pagina 1di 15

Principal Components Analysis & Independent Components Analysis

Aaron Clarke SN: 206071237 Prof. Robert Cribbie Statistics 6130

Introduction: A common problem in information theory is that of representing a message space with the smallest possible set of message components (Cottrell et al., 1987; Oja, 1983). That is, to find a basis set of message components that could be used to form every message, given a particular set of possible messages. For example, the Morse code forms a basis set for the set of possible message that can be transmitted by Morse code. If one wanted to form the message SOS, then one would simply combine the element for S (---) with the element for O () such that the full message would be ------. If the only message that anyone ever sent by Morse code was ------, then --- and would be the basis set for the set of all messages sent by Morse code. The question then arises: is there a basis set for the set of images that the human visual system was likely to encounter in its evolutionary environment? If so then one would expect that the human visual system would be adapted to optimally perceive this basis set and would use it to reconstruct observed image information. It is already known that Fourier analysis can be used to decompose any given image into a set of spatial frequency components of varying phase, orientation and amplitude, however, the elements of the set of all possible spatial frequency components are not equally distributed in natural images (Field, 1987). Thus one would expect that there might exist a smaller basis set of spatial frequency components that could be used to compose the set of natural images that the human visual system is exposed to. Evidence supporting this theory can be found in the neurophysiological literature where it has been shown that single neurons in the visual cortex respond to a finite set of Gaussian enveloped Fourier components (called Gabor patches) of particular spatial frequencies and orientations (Hubel & Wiesel, 1968). Thus,

it seems that the visual system somehow de-correlates the incoming visual information to produce a useful basis set of image components with which to filter the incoming images. A possible method for computing this basis set lies in principal components analysis (PCA). PCA The idea behind principal components analysis is that a given message set, or a given data set, is linearly transformed into a smaller dimensional dataset with the property that each of the transformed variables is uncorrelated (Gill, 2002). PCA generates a basis set for a set of messages by rotating the message data in the sample space of observed messages (Gill, 2002). For example, in the figure below, the original data are presented on the left and the PCA rotated data are presented on the right.

Figure 1: Left: original data. Right: PCA rotated data.

Note that the shape of the distribution is preserved, but the regression line through the PCA rotated data set is now aligned with the x-axis, thereby de-correlating the data. This

is achieved mathematically by representing each message as a column vector U and by placing each column vector in a matrix X. Ui = [ Ui1 Ui2 : Uim] and X = [ U11 U21 U31 Un1 U12 U22 U32 Un2 : : : : U1m U2m U3m Unm ] Each row of a message column is treated as a separate variable, and each variable defines a separate axis (Gill, 2002). The covariance matrix R for the matrix of message columns X is then computed and then the eigenvectors E and the matrix of eigenvalues for the variance-covariance matrix are computed (Gill, 2002). The eigenvalues of the variancecovariance matrix represent the variance-covariance matrix of the rotation defined by the principal components (Gill, 2002). So each eigenvalue is the variance of a principal component where the first principal component now accounts for the largest variance (Gill, 2002). The eigenvector matrix E provides the transformation of the data points from the original message matrix X to the PCA metric Y through simple matrix multiplication (Gill, 2002). Y = XE Here, the principal component scores matrix equals the original matrix multiplied by the eigenvector matrix. For a practical example of how PCA can be used in images, suppose that one were given the set of faces illustrated in figure 2.

Figure 2: Original set of faces. If this set represented the full set of faces that a visual system were exposed to, then one could compute the set of basis faces required to fully represent those faces. This basis set is shown in figure 3.

Figure 3: Complete basis set (i.e. the principal components) for the set of faces given in figure 2. Going from left to right and from top to bottom the variances accounted for by each face are: 72.9242%, 6.8400%, 4.7808%, 3.4802%, 2.9461%, 2.1323%, 2.0361%, 1.8905%, 1.6516% and 1.3180% Additionally, however, one can also see what percent of the total variance in the face set is accounted for by each individual basis face. Here it can be seen that the first basis face accounts for most of the variance in the basis face set (~73%). Subjectively this face looks the most like a face out of any of the faces in the set. The next face accounts for a much smaller percentage of the total variance in the face set, as do all of the subsequent

faces. If one were arbitrarily to set a cut-off level for the variance explained by a basis face at 2% then one could roughly represent all of the faces in the original face set using only the first 7 basis faces as can be seen in figure 4.

Figure 4: Formulation of the first face using the basis set. The top left-hand face was composed using only the first principal component. The face to the right of it was composed using the first two principal components. This pattern continues from left to right and from top to bottom. The last face uses all ten principal components and is exactly the same as the original face. Note that an excellent approximation is achieved using 7 or more of the 10 basis faces. The Matlab code that I wrote to compute the basis faces can be found in Appendix A. One benefit of using PCA then is that it allows information to be compressed without the loss of the subjective qualities of the information. Specifically, if one wanted to transmit the full set of faces given in figure 2, then one would need only to transmit the first seven basis faces from figure 3 and the amplitudes that each basis face would need to be multiplied by to combine them to regenerate each original face. In this case, the use of PCA results in roughly a 30% reduction in the amount of information that would need to be sent. This procedure has been extended by other researchers to massive sets of natural images where it was found that the components of the natural images tended to resemble the Gaussian enveloped Fourier components noted by Hubel & Wiesel (1968) to be the optimal stimuli for exciting neurons in the visual cortex (Olshausen & Field, 1997).

ICA Assume that one has the following neural network: X1 X2 Xm Neural model Y1 Y2 Ym

Where the column vector X represents the sensory inputs from an external stimulus U, and: U = [ U1 U2 : Um] If the external stimulus (U) is subject to mixing where A is a mixing matrix of size m-bym, then the sensory information received by the brain (X) is given as: X = AU (Haykin, 1999). In this case, in order for the brain to pick out the original signal matrix U, it is necessary to develop a neural model that unmixes the mixing done by A, and transforms the inputs X into the output Y such that the elements of Y are as statistically independent as is possible (Haykin, 1999). In order to do this, it is necessary to compute an unmixing matrix W that reverses the effects of A, such that Y = WX And Y = [ Y1 Y2 :

Ym] (Haykin, 1999). The unmixing matrix W in this case would be the m-by-m matrix that when multiplying X makes the elements of the resultant product Y as statistically independent as is possible. Thus, the elements of Y would be the independent components present in the original signal U, although rescaled and permuted (Haykin, 1999). In order to make the elements of Y as statistically independent as is possible, it is necessary to minimize the mutual information conveyed by any pair of elements in Y (Haykin, 1999). Mutual information is a measure of the uncertainty about Yi after Yj has been observed (Haykin, 1999). The mutual information I(Yi;Yj) between Yi and Yj, then, is the entropy of Yi minus the conditional entropy of Yi given Yj: I(Yi;Yj) = H(Yi) H(Yi|Yj) (Haykin, 1999). This situation is represented in the following Venn diagram. H(Yi,Yj)

H(Yi|Yj)

I(Yi,Yj)

H(Yj|Yi)

H(Yi)

H(Yj)

In order for all of the elements of Y to be statistically independent, the Kullback-Leibler divergence between the probability density function Y and the probability density

function defined by removing each element Yi from Y (where i goes from 1 to m) must be minimized (Haykin, 1999). Some Matlab code that I wrote to accomplish this objective using the set of faces in figure 2 can be found in Appendix B. In the code that I wrote, it is implicitly assumed that the unmixing matrix W converges by 1200 iterations. This may not necessarily be the case, however, I have found it to work in some preliminary tests with the face stimuli. In order to obtain a quantitative index of the demixers performance, one may calculate a global rejection index as:

m m m pij pij = 1 + 1 pki i =1 j =1 max pik j =1 i =1 max k k


m

where P = {pij}= WA (Haykin, 1999). The performance index is a measure of the diagonality of matrix P (Haykin, 1999). If the matrix P is perfectly diagonal, = 0 (Haykin, 1999). For a matrix P whose elements are not concentrated on the principal diagonal, the performance index will be high (Haykin, 1999). A good performance index is around 0.05 (Haykin, 1999). This index could be used in the iterative code that I wrote for computing W. Instead of iterating the loop for 1200 cycles, one could instead use a while loop, evaluating the performance index at each iteration of the loop, and exiting only when the performance index reached a certain threshold level. The calculation of this index, however, is computationally intensive, and so I didnt include it in my code in the hopes that any time I loose by iterating the loop calculating W past the threshold performance index, Ill make up in the speed of my loop. In the end, ICA may be viewed as an extension of PCA. Whereas PCA can only impose independence up to the second order while constraining the direction vectors to

be orthogonal, ICA imposes statistical independence on the individual components of the output vector Y and has no orthogonality constraint. An example of the application of ICA to images can be found in figure 5.

Figure 5: Independent components derived from the original image set given in figure 2. Note the marked differences between the independent components of the image set presented here and the principal components of the image set depicted in figure 3. Also note that since the independent components are maximally statistically they all account for an equal percent of the variance in the image set. Here the same initial set of faces from figure 2 that was used in the PCA demonstration is used again. Each image was vectorized by taking each row of the image and concatenating it with the previous row to produce a row vector of length (image length image height). Each row vector was then placed in a matrix X, providing the input vector to the above diagrammed neural network. In the algorithm for calculating the independent components, W is calculated, and the independent components matrix Y can be calculated as Y = WX, where the rows of Y are the independent components of the original images. These components were presented in figure 5. Note that in order to re-construct any of the original images, one must simply multiply the inverse of the mixing matrix W by the matrix Y, where

X = W-1Y. The top left-hand image from figure 2 is reconstructed in this manner in figure 6.

Figure 6: Re-constitution of the top left-hand corner face from figure 2 using the independent components for the image set. Going from left to right and from top to bottom each image uses incrementally more independent components in its reconstitution of the original face. Note that each component adds a lot of information reflecting the high statistical independence of each component. Note here, however, that each independent component contributes a substantial amount to the subjective impression of the face as resembling the original face. This property reflects the statistical independence of the components derived from the original face set used to reconstruct the faces. In the end, ICA doesnt compress the image information as much as PCA, however, it encodes the components more efficiently, making each component a valuable contributor to the original image set. This property is desirable in neural networks where it is necessary to make the most efficient use possible of the neurons that are available for encoding information. That is, given a set of neurons that are to be used to represent information about images in the real world, it would be efficient to have the outputs of those neurons as statistically independent as possible. This result also explains the neurophysiological findings of Hubel and Weisel (1968) as noted in Bell and Sejnowski (1997).

Reference: Bell, A.J., and Sejnowski, T.J. (1997). The independent components of natural scenes are edge filters. Vision Research, 37, 3327-3338.

Cottrell, G.W., Munro, P.W., and Zipser, D. (1987). Image Compression by BackPropagation: A Demonstration of Extensional Programming. Technical Report 8720, University of California, San Diego, Institute of Cognitive Science.

Field, D.J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America A, 4, 2379-2394.

Gill, J. (2002). What Is Principle Components Analysis Anyway? Retreived January 2, 2003, from http://www.clas.ufl.edu/~jgill/papers/pca.pdf

Haykin S. (1999). Neural Networks A Comprehensive Foundation Second Edition. New Jersay: Prentice Hall.

Oja, E. (1983) Subspace methods of pattern Recognition. Letchworth, England: Research Studies Press and Wiley.

Appendix A % PCA % Load the original set of faces % (compliments of Prof. Jason Gould, University of Indianna)

load FaceStruct.mat; names = fieldnames(images); % Initialize the matrix of images Raw = zeros(prod(size(images.andrea)),length(names)); % Put the images into the matrix of images for i = 1:length(names) eval(['Raw(:,i) = reshape(images.',char(names(i)),',[length(Raw(:,1)) 1]);']); end % Normalize the image matrix to have zero mean % and unit standard deviation ColMeans = repmat(mean(Raw),length(Raw(:,1)),1); ColStd = repmat(std(Raw),length(Raw(:,1)),1); X = (Raw-ColMeans)./ColStd; % Calculate the variance-covariance matrix for the % normalize image matrix R = cov(X); % Calculate the eigenvectors and eigenvalues of the % variance-covariance matrix [E, LATENT, EXPLAINED] = pcacov(R); % Calculate the principal component scores % (These are the filters for the images) Y = X*E; % Calculate the inverse of the eigenvector matrix Einv = inv(E); % Build and display the first face using the principal components for i = 1:length(names) eval(['ReformFace1',num2str(i),' = Y(:,1:i)*Einv(1:i,:);']); figure eval(['img',num2str(i),' = scale(reshape(ReformFace1',num2str(i),'(:,1),size(images.andrea)));']); eval(['image(repmat(img',num2str(i),',[1 1 3]));']); axis equal eval(['imwrite(img',num2str(i),',''MakeAnd',num2str(i),'.jpg'',''jpg'');']); end % Display the principal components for i = 1:length(names)

eval([char(names(i)),' = 2*scale(reshape(X(:,i),size(images.andrea)))-1;']); figure eval(['image(repmat(scale(',char(names(i)),'),[1 1 3]));']); axis equal eval(['title(''Variance explained = ',num2str(EXPLAINED(i)),''');']); end

Appendix B % ICA % Load the original face set % (Courtesy of Professor Jason Gould, University of Indianna) load FaceStruct.mat names = fieldnames(images); % Calculate the length of the column vector composed of % the concatenated columns of one image. ColLength = prod(size(images.andrea)); % Initialize the observation vector X = zeros(length(names),ColLength); % Fill in the observation vector for i = 1:length(names) eval(['X(i,:) = reshape(images.',char(names{i}),',[1 ColLength]);']); end % Initialize the unmixing matrix W = rand(length(names))*0.05; % Initialize the unmixed matrix Y = W*X; % Calculate the updating parameter phi for the given W and X phi = 1/2*Y.^5 + 2/3*Y.^7 + 15/2*Y.^9 + 2/15*Y.^11 + 112/3*Y.^13 +... 128*Y.^15 - 512/3*Y.^17; % Learning rate eta = 0.1; % Initialize waitbar (this isn't a necessary part of the code, % it just lets you see how far along the algorithm is as it's

% iterating. h = waitbar(0,'Calculating matrix W...'); n = 1200; % Main ICA loop, repeat n times, so that the matrix W converges for i = 1:n W = W + eta*(eye(size(W)) - phi*Y')*W; Y = W*X; phi = 1/2*Y.^5 + 2/3*Y.^7 + 15/2*Y.^9 + 2/15*Y.^11 + 112/3*Y.^13 + 128*Y.^15 512/3*Y.^17; waitbar(i/n,h); end % Close waitbar close(h) % Display and save the images for the independent components ImageMatrix = scale(Y); for i = 1:length(names) img{i} = repmat(reshape(ImageMatrix(i,:),size(images.andrea)),[1 1 3]); figure image(img{i}); axis equal eval(['imwrite(img{i},''IndComp',num2str(i),'.jpg'',''jpg'');']); end % Calculate the inverse of the unmixing matrix W Winv = inv(W); % Rebuild and display the first image using the independent components for i = 1:length(names) A = Winv(:,1:i)*Y(1:i,:); ImageMatrix = scale(A); img{i} = repmat(reshape(ImageMatrix(1,:),size(images.andrea)),[1 1 3]); figure image(img{i}); axis equal eval(['imwrite(img{i},''RebuiltUsing',num2str(i),'.jpg'',''jpg'');']); end