Sei sulla pagina 1di 12

Principal Components Analysis Step by Step (Part I)

by terzi in P,Statistical Research Hi everyone. For my first post Ill start a basic step by step tutorial regarding the use of principal component analysis, an old technique that can be dated back to 1901 with the first approaches by Pearson. It was Hotelling who derived a definitive solution about three decades later. Although the mathematical background in the analysis isnt too complex, Ill try to avoid formulas to focus on a single application that may be useful to understand the basics of the analysis. I hope this turns out helpful at all. Mexico is divided in 32 states (Actually 31 and a federal district, but let us consider 32 states). Some places have a better quality of life and a higher level of development than others, as it happens in every country in the world. Therefore, one interesting subject of study would be a comparison of that development among the states. But, what exactly is development? Well, that isnt just a number, right? That would include many things: life expectancy in the state, Gross Domestic Product, percentage of people with access to fresh water, portion of population that is illiterate. It seems we should compare all these things at once in order to perform a good comparison. And thats what PCA is good for. The data we will use has the following form:

As you can see, there are 6 variables, all in continuous scale. I just used those for convenience, but some analysis will include twenty, forty or even hundreds. How would you compare all of them at once? One way would be to produce an index and then compare that index, right? But another question arises: how do we get that index? We could get the sum of all the variables, like some psychological tests do. We could get an average, or sum all the variables and then subtract the year I was born. Those are good indexes, but which one would be the best? In PCA we want to obtain the best indexes, that is, the linear combinations that can reproduce the highest possible amount of variance present in our data. Hotelling suggested using a Singular Value Decomposition of the covariance matrix in order to obtain that result. When used as a descriptive tool, PCA has no distributional assumptions. So, the first step will require an analysis of the covariance matrix. Here it is:

Great! We should perform an SVD on this matrix and our results will be a PCA! Well, its not that simple. You can see the variances in bold. It is easy to notice that certain variable that involves money has a huge variance, simply because it is measured with a different scale. Is that a problem? Yeah, because PCA wants to represent the total variance, and here the total variance is almost entirely due to the GDP. The rest of the variables will be somehow ignored by the analysis, more exactly they will receive a lower weight in the results. We can perform a PCA of this matrix but our indexes will be almost totally about GDP and nothing more. In order to obtain results that give the same importance to every variable in the study, we should standardize the variables. That way no variable will have a larger influence in our results. Here we have the standardized covariance matrix:

No, youre not wrong. The standardized covariances are the correlations. That is why your software will ask for a covariance or correlation matrix, depending on whether you standardize your data. As you can see, it will largely depend on the scales you are using. Ill take the correlation matrix, since I dont want the first variable to overtake the rest. So, SVD on the R matrix and, voila! PCA! Your stats software will get you something like this:

Six variables generated six eigenvalues. These eigenvalues originated from the SVD and are related to six principal components that can express the total amount of variation in our data. Every time you run a Principal Component Analysis, you will get as many components as variables. Certainly, with PCA we want to reduce dimension, so why having six indexes? The key is that these contain information regarding several variables. As you can see, the first eigenvalue, attached to the first principal component, can positively express 74.06% of our total variance. One single value will express of our six measures of well being! Since the principal components are orthogonal, the amount of total variance expressed by the first two PC is 87.37%, the sum of the proportions explained by them individually. Two indexes will give us almost 90% of the information! So, analyzing fewer pieces of information will give us almost the same results as analyzing the whole set of variables. In practice, you can work with as many components as you wish. A good rule of thumb suggests that you should keep only the components that have an eigenvalue larger than 1 (only if you based your PCA on the correlation matrix). Others suggest to look for a breaking point in a screeplot of eigenvalues, or simply retaining the components that will give you the amount of explained variance that you find suitable. Our screeplot for this case is shown below:

We look for a breaking point, a moment where the line starts flattening. That is achieved with 2 eigenvalues. Some may say that it would be enough to work just with the first principal component but for the purpose of this tutorial, Ill keep the first two in order to show some different interpretations. But this is getting way longer than a first post should be, so Ill continue next time with the interpretation and understanding of principal components analysis.

Principal Components Analysis Step by Step (Part II)


by terzi in P,Statistical Research Hi again! This time Ill continue with some results and interpretations from a Principal Component Analysis, hoping this tiny tutorial can help you understand some of the basics. Although we wont cover every aspect, some ideas will be useful when you happen to need PCA.

The application that weve been working is a study aimed to compare development among the states in Mexico. Last time, we started our analysis and decided to retain two principal components. You can check that last post here. It is time to continue with interpretations. Last time we decided to keep two components, so it is time check the first two eigenvectors that resulted from our Singular Value Decomposition. First of all, it is important that we retain only two principal components. I mean, it is important to let the software know it. By doing that, most programs will show us some important information regarding the unexplained variance. Recall that our first two components will allow us to represent 87% of the variation, so theres still a 13% hanging around. After adjusting the solution for two components, we can see them in order to understand the variables involved:

Lets begin analyzing the unexplained variance. As you see, sewer access is the variable that was less explained by our two components. The rest have only a mere 15% of variance unexplained at most. Since there is no single variable with a high unexplained rate, we can assume that most variables will be correctly represented. One of the most important interpretations in PCA is that of the components, the eigenvectors. Each component requires an interpretation: based on the variables that are important we can even name our components. As you can see, in PC1 almost all variables have the same value. Remember that one of our main reasons for analyzing the components is to understand the numbers in the index related to it, the scores that we will discuss below. The higher numbers, in absolute value, mean that those particular variables are more important to that component. In PC1 all variables seem to contribute and most have positive sign, which means that a state with larger values in GDP, access to services and life expectancy will get larger numbers in that index. Notice that the variable illiteracy has a negative sign: a state with high illiteracy will have low values in this index. It is important to realize how cool PCA is, since no one ever told the software that illiteracy was a bad sign for development. Anyway, our first PC is an index of overall development or wellbeing, where states with higher scores can be seen as those with a higher quality of life. One of the reasons why we kept two components is because the interpretation of the first one was quite easy, dont you think? Id bet most books have examples like it. The second PC is something that resembles some real situations and I find its interpretation more worthy for the purpose of this tutorial. Two variables have almost zero values: water and illiteracy, which

means that their effect in the component is almost nonexistent, we can ignore them. Sewer and Electricity have positive values. GDP and life expectancy have negative values. This component is a contrast. The analysis detected some states with low economic performance but great infrastructure. Politics, certainly. The key is to understand what a score would mean for this index. Low numbers in this score would mean great GDP and life expectancy in a state with poor infrastructure for its citizens. High values would mean great infrastructure but an economic performance not so good. The ideal for any state would be near zero values, which would mean some balance or maybe slightly negative, everyone wants to be rich, right? This weird situation arose since GDP is the least correlated variable in our group. Check the correlation matrix and youll see yet another proof that richer people arent always better. From the principal components we can obtain one of the most important results in PCA: principal component scores. Those are our indexes. Scores represent the values that each unit, in this case, each state would get in each component. This is crucial, specially since we now understand what the values in the components would mean. With the first scores, we can see the valuation each state got in our overall development index, our first PC:

It is easy to tell which are the five most developed states and the five least developed. I live in Veracruz, by the way so Im starting to regret analyzing this data. Anyway, remember that this first scores only give us a 74% of the real variation. In order to get that 87% we wish, we should check both indexes at once. That is usually done by graphing indexes, i.e. both scores, in a scatterplot, which is commonly named Principal Components Score Plot or Principal Components Biplot. We can see it here:

There is a lot to learn from this graph. First of all, whos enjoying the best life? Well, our first score is represented by the x-axis, so states at the left have lowest values in our overall well being index, our first PC. So, the best states for living should be in the right part of the graph: too bad for Veracruz. Do you remember what we concluded regarding PC2? Low numbers would mean great GDP and life expectancy but poor infrastructure and high values would mean great infrastructure but low economic performance. Since it was stated that the ideal for any state would be near zero or slightly negative, those states above Jalisco may be considered less developed. One could say that Jalisco, Nayarit, Distrito Federal and the states in the low right corner are the winners: the most developed areas. Another really interesting result are the clusters that appear in the graph. It is easy to notice three or maybe four different groups: PCA is a good way to start a cluster analysis. From the graph you can also find the situation of any individual state. For instance, Distrito Federal, which is the official name of the capital of Mexico, Mexico City, is certainly one of the most developed areas. Points that are near in the graph represent similar situations in those estates. It is easy to compare states with each others or even geographic areas. Notice that two states are colored in gray in the graph. That is because a residual analysis showed that these two points were not properly represented: these are part of that 13% PCA could not

get. But wait, Residual analysis in PCA? Yes, well see that when we get to the final part of this tutorial: Assessing the fit in the solution.

Principal Components Analysis Step by Step (Part III)


by terzi in P,Statistical Research You can find the first two sections of this tutorial here: Principal Components Analysis Step by Step (Part I) Principal Components Analysis Step by Step (Part II) Hi! This is the last part of this tutorial, a tiny introduction to Principal Components Analysis. So far we have seen the basics of the analysis and some interpretations. Now, well see some post estimation statistics and values that are usually used to qualify the overall results of our PCA. In fact these methods look more like pre estimation commands rather than post estimation ones, yet these are commonly used at the end to confirm the results. I think the most common post estimation measure for PCA is the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy, so lets start with that. KMO takes values between 0 and 1, with small values indicating that overall the variables have too little in common to warrant a PCA analysis. This is done by comparing the partial correlations and the correlations of our variables. Historically (or at least according STATAs help file), the following labels are often given to values of KMO: 0.00 to 0.49 >>> Unacceptable 0.50 to 0.59 >>> Miserable 0.60 to 0.69 >>> Mediocre 0.70 to 0.79 >>> Middling 0.80 to 0.89 >>> Meritorious 0.90 to 1.00 >>> Marvelous I wonder why almost all labels start with an M. Anyway, for our example, the KMO resulted in 0.8153 which is Meritorious. Wow, it sound like we would get a medal for that, so lets just call it OK. Usually, when KMO turns out to be low that is also seen in the unexplained variance.

Analysis with low KMO values will usually need a higher number of components to obtain a good representation. Remember that KMO is calculated based on correlations so it is not influenced by the number of components retained in a PCA. There are other ways to check whether the analysis properly described our data, such as using a post estimation procedure to understand the residuals. Wait, residuals? In some forms, PCA can be seen as a model, actually Pearson first developed it as some form of regression model. It is possible to invert the equation that produces PC from data, in order to produce a formula that can compute the data from the Principal Components. However, the original data matrix will only be reproduced when all the components are retained. Certainly, that almost never happens. Because of that, it is useful to estimate our data using only the retained components, then analyzing the distance between these predicted outcomes and the real values. These residuals can be can be tested by means of the sum of squares of the residuals. This is called the Q-statistic or Rao statistic. The values for the Q-statistic in our problem are shown above for every observation:

Theres a way to obtain a critical value, in order to test for problems or outliers in Q-statistics. In this case, it will be enough to notice that two states have remarkably higher values, that may indicate that these states are not properly represented by the first two components. If any of this states were absolutely important for our results, it may be necessary to retain more PCs. Since thats not our case, it should be enough to comment in our conclusions that the states of Tabasco and Yucatn were not properly represented by our analysis. In a very similar way, t is also possible to analyze the fitted correlation matrix to understand which variables and relationships were not fully incorporated in these results. We can now

calculate the residual correlation matrix, whose elements are the difference between the actual and the fitted correlations. For our example, we get the following matrix or residuals:

Most relationships have been properly represented, with most residuals below 0.06. The misrepresentation is mostly due to the variation in the variables, which are not fully accounted. Some variables like sewer coverage, water coverage and life expectancy have some large residuals (in bold). Does this sound familiar? Remember that when we analyzed the unexplained variance accounted for each variables we got these exact same numbers. Now you know where they came from. This additional post estimation analysis must be incorporated in any Principal Components Analysis or ate least some of this results. This information is important in order to understand the solution that was obtained. As you can imagine, most analysis will drop several components but using these measures the analyst can easily understand the impact of that unexplained variance. Well, that is about everything I can contribute regarding how to perform a good Principal Components Analysis. Even when there are many things we didnt discuss, such as analysis of characteristic roots, inferential procedures, rotations, etc. I still hope this humble practical guide will help you in understanding the concepts that lies beneath this beautiful statistical tool. Dont forget to check the references out in order to gain some deeper knowledge about PCA and remember that we have a whole forum and a this great website in order to help those in statistical need. Have a nice day!

REFERENCES

Cahill, Miles B. et al Using principal components to produce an economic and social development index: An application to Latin America and the U.S. Atlantic Economic Journal Volume 29, Number 3. Jackson, J. Edward A Users Guide to Principal Components. Wiley Interscience, 1991. Lattin, James Analyzing Multivariate Data Duxbury Press, 2002.

Rabe-Heskett, Sophia A Handbook of Statistical Analyses using STATA. CRC Press, 2004. Venables, W.N. & Ripley, B.D. Modern Applied Statistics with S. Springer, 2002.

Potrebbero piacerti anche