Sei sulla pagina 1di 36

Research Methods

Factor Analysis

Factor analysis is a technique used to uncover the latent structure


(dimensions) of a set of variables.
It reduces attribute space from a larger number of variables to a
smaller number of factors and as such is a "non-dependent"
procedure (that is, it does not assume a dependent variable is
specified).
Factor analysis could be used for any of the following purposes:
To reduce a large number of variables to a smaller number of
factors for modeling purposes, where the large number of variables
precludes modeling all the measures individually.
As such, factor analysis is integrated in structural equation modeling
(SEM), helping confirm the latent variables modeled by SEM. However,
factor analysis can be and is often used on a stand-alone basis for
similar purposes.

To establish that multiple tests measure the same factor, thereby


giving justification for administering fewer tests.
Factor analysis originated a over a century ago with Charles Spearman's attempts
to show that a wide variety of mental tests could be explained by a single
underlying intelligence factor (an idea we reject in contemporary psychometrics).

To validate a scale or index by demonstrating that its constituent


items load on the same factor, and to drop proposed scale items
which cross-load on more than one factor.
To select a subset of variables from a larger set, based on which
original variables have the highest correlations with the principal
component factors.

A non-technical analogy
A mother sees various bumps and shapes under a blanket at the
bottom of a bed. When one shape moves toward the top of the bed,
all the other bumps and shapes move toward the top also, so the
mother concludes that what is under the blanket is a single thing,
most likely her child. Similarly, factor analysis takes as input a
number of measures and tests, analogous to the bumps and
shapes. Those that move together are considered a single thing,
which it labels a factor. That is, in factor analysis the researcher is
assuming that there is a "child" out there in the form of an underlying
factor, and he or she takes simultaneous movement (correlation) as
evidence of its existence. If correlation is spurious for some reason,
this inference will be mistaken, of course, so it is important when
conducting factor analysis that possible variables which might
introduce spuriousness, such as anteceding causes, be included in
the analysis and taken into account.

Factor analysis is part of the general linear model (GLM) family of


procedures and makes many of the same assumptions as multiple
regression: linear relationships, interval or near-interval data,
untruncated variables, proper specification (relevant variables
included, extraneous ones excluded), lack of high multicollinearity,
and multivariate normality for purposes of significance testing.
Factor analysis generates a table in which the rows are the
observed raw indicator variables and the columns are the factors or
latent variables which explain as much of the variance in these
variables as possible.
The cells in this table are factor loadings, and the meaning of the factors must be
induced from seeing which variables are most heavily loaded on which factors.
This inferential labeling process can be full of subjectivity as diverse researchers
impute different labels.

Initial Considerations
Sample Size
Correlation coefficients fluctuate from sample to sample, much more
so in small samples than in large.
Therefore, the reliability of factor analysis is also dependent on
sample size.
Much has been written about the necessary sample size for factor
analysis resulting in many rules-of-thumb.
The common rule is to suggest that a researcher has at least 10-15
subjects per variable.
Although Ive heard this rule bandied about on numerous occasions its
empirical basis is unclear (although Nunnally, 1978 did recommend
having 10 times as many subjects as variables).
Kass and Tinsley (1979) recommended having between 5 and 10
subjects per variable up to a total of 300 (beyond which test parameters
tend to be stable regardless of the subject to variable ratio).
In fact, Tabachnick and Fidell (1996) agree that 'it is comforting to have
at least 300 cases for factor analysis (p. 640) and Comrey and Lee
(1992) class 300 as a good sample size, 100 as poor and 1000 as
excellent.

Fortunately, recent years have seen empirical research done in


the form of experiments using simulated data (so-called Monte
Carlo studies).
Arrindell and van der Ende (1985) used real-life data to investigate
the effect of different subject to variable ratios.
They concluded that changes in this ratio made little difference to the
stability of factor solutions.
More recently, Guadagnoli and Velicer (1988) found that the most important
factors in determining reliable factor solutions was the absolute sample size
and the absolute magnitude of factor loadings.
Basically, they argued that if a factor has four or more loadings greater
than 0.6 then it is reliable regardless of sample size.
Also, factors with 10 or more loadings greater than 0.40 are reliable if
the sample size is greater than 150.
Finally, factors with a few low loadings should not be interpreted unless
the sample size is 300 or more.

So,
whats clear from this work is that a sample of 300 or more
will probably provide a stable factor solution but that a wise
researcher will measure enough variables to adequately
measure all of the factors that theoretically they would expect
to find.

Data Screening
If we find any variables that do not correlate with
any other variables (or very few) then you should
consider excluding these variables before the
factor analysis is run.
In this case, variables correlate only with themselves and
all other correlation coefficients are close to zero.

SPSS tests this using Bartletts test of sphericity


(see the next slides).

The correlations between variables can be checked using the


correlate procedure to create a correlation matrix of all variables.
This matrix can also be created as part of the main factor analysis.

The opposite problem is when variables correlate too highly.


Although mild multicollinearity is not a problem for factor analysis it is
important to avoid extreme multicollinearity (i.e. variables that are very
highly correlated) and singularity (variables that are perfectly correlated).
Therefore, at this early stage we look to eliminate any variables that dont
correlate with any other variables or that correlate very highly with other
variables (R < 0.9).

As well as looking for interrelations, you should ensure that variables


have roughly normal distributions and are measured at an interval
level (which Likert scales are, perhaps wrongly, assumed to be!).
The assumption of normality is important only if you wish to
generalize the results of your analysis beyond the sample collected.

The SPSS Stress Test

Our Example
The SPSS Stress
Test

The Data from


the
SPSS Stress
Test

Running the Analysis


Access the main dialog
box by using the
Analyze
Dimension Reduction Factor .
Simply select the
variables you want to
include in the analysis
(remember to exclude
any variables that were
identified as
problematic
during the data
screening) and transfer
them to the box labeled
Variables by clicking on
arrow button.

There are several options available, the first of which can be accessed by clicking
on
to access the dialog box in the preceding Figure.
The Univariate descriptives option provides means and standard deviations for each
variable.
The choice of which of the two variables to eliminate will be fairly arbitrary and finding
multicollinearity in the data should raise questions about the choice of items within your
questionnaire.
KMO and Bartletts test of sphericity produces the Kaiser-Meyer-Olkin measure of
sampling adequacy and Bartletts test.
With a sample of 2571 we shouldnt have cause to worry about the sample size.
The value of KMO should be greater than 0.5 if the sample is adequate.

Optional for our classes.

When you have finished with this dialog box click on _____ to return to the main dialog
box.

Factor Extraction on SPSS


To access the extraction dialog box, click on in the main dialog box.
There are a number of ways of conducting a factor analysis and when
and where you use the various methods depend on the method chosen
and what you hope to do with the analysis.
There are two things to consider: whether you want to generalize the
findings from your sample to a population and whether you are
exploring your data or testing a specific hypothesis.
Here, were looking at techniques for exploring data using factor analysis.
Hypothesis testing requires considerable complexity and can be done with
computer programs such as LISREL and others.

Assuming we want to explore our data we then need to consider


whether we want to apply our findings to the sample collected
(descriptive method) or to generalize our findings to a population
(inferential methods).
When factor analysis was originally developed it was assumed that it
would be used to explore data to generate future hypotheses.
As such, it was assumed that the technique would be applied to the
entire population of interest.
Today, we assume that subjects are randomly selected and that
the variables measured constitute the population of interest.
By assuming this, it is possible to develop techniques from which
the results can be generalized from the sample subjects to a
larger population.

In the Analyze box there are two options: to analyze the


Correlation matrix or to analyze the Covariance matrix.
You should be happy with the idea that these two matrices are
actually different versions of the same thing: the correlation
matrix is the standardized version of the covariance matrix.
Analyzing the correlation matrix is a useful default method
because it takes the standardized form of the matrix; therefore,
if variables have been measured using different scales this will
not affect the analysis.
In our example, all variables have been measured using the
same measurement scale (a five-point Likert scale), but often
you will want to analyze variables that use different
measurement scales.
Analyzing the correlation matrix ensures that differences in
measurement scales are accounted for.

The Display box has two options within it: to display


the Unrotated factor solution and a Scree plot.
The scree plot is a useful way of establishing how many
factors should be retained in an analysis.
The unrotated factor solution is useful in assessing the
improvement of interpretation due to rotation.
If the rotated solution is little better than the unrotated
solution then it is possible that an inappropriate (or less
optimal) rotation method has been used.

The Extract box provides options pertaining to the retention of factors.


You have the choice of either selecting factors with Eigenvalues greater than
a user-specified value or retaining a fixed number of factors.
For the Eigenvalues over option the default is Kaiser s recommendation of
eigenvalues over 1, but you could change this to Jolliffe s recommendation
of 0.7 or any other value you want.
It is probably best to run a primary analysis with the Eigenvalues over
1 option selected, select a scree plot, and compare the results.
If looking at the scree plot and the eigenvalues over 1 lead you to retain
the same number of factors then continue with the analysis and be
happy.
If the two criteria give different results then examine the communalities
and decide for yourself which of the two criteria to believe.
If you decide to use the scree plot then you may want to redo the
analysis specifying the number of factors to extract.
The number of factors to be extracted can be specified by selecting
Number of factors and then typing the appropriate number in the space
provided (e.g. 4).

Rotation Techniques
The interpretability of factors can be improved through rotation.
Rotation maximizes the loading of each variable on one of the extracted factors while minimizing
the loading on all other factors.
This process makes it much clearer which variables relate to which factors.
Rotation works through changing the absolute values of the variables while keeping their
differential values constant.
Click on _
to access the dialog box.

Varimax, quartimax and equamax are all orthogonal rotations while direct oblimin and promax are oblique
rotations.
Quartimax rotation attempts to maximize the spread of factor loadings for a variable across all factors.

Varimax is the opposite in that it attempts to maximize the dispersion of loadings within factors.

Therefore, interpreting variables becomes easier.


However, this often results in lots of variables loading highly onto a single factor.

Therefore, it tries to load a smaller number of variables highly onto each factor resulting in more interpretable clusters of
factors.

Equamax is a hybrid of the other two approaches and is reported to behave fairly erratically.
In most circumstances the default of 25 is more than adequate for SPSS to find a solution for a given
data set.
However, if you have a large data set (like we have here) then the computer might have difficulty finding a
solution (especially for oblique rotation). To allow for the large data set we are using change the value to 30.

Scores
The factor scores dialog box can be accessed by clicking in the
main dialog box.
This option allows you to save factor scores for each subject in the data
editor.
SPSS creates a new column for each factor extracted and then places
the factor score for each subject within that column.
These scores can then be used for further analysis, or simply to identify
groups of subjects who score highly on particular factors.

There are three methods of obtaining these scores.


If you want to ensure that factor scores are uncorrelated then select the
Anderson-Rubin method

Options
This set of options can be obtained by clicking on in the main dialog box.
Missing data are a problem for factor analysis just like most other procedures and SPSS
provides a choice of excluding cases or estimating a value for a case.
You should consider the distribution, of missing data.

If the missing data are non-normally distributed or the sample size after exclusion is
too small then estimation is necessary.
SPSS uses the mean as an estimate (Replace with mean).
These procedures lower the standard deviation of variables and so can lead to
significant results that would otherwise be non-significant.
Therefore, if missing data are random, you might consider excluding cases. SPSS
allows you to either Exclude cases listwise in which case any subject with missing
data for any variable is excluded, or to Exclude cases pairwise in which case a
subject s data are excluded only from calculations for which a datum is missing.
The final two options relate to how coefficients are displayed.
By default SPSS will list variables in the order in which they are entered into the data editor. Usually, this
format is most convenient.
However, when interpreting factors it is sometimes useful to list variables by size.
By selecting Sorted by size, SPSS will order the variables by their factor loadings.
In fact, it does this sorting fairly intelligently so that all of the variables that load highly onto the same
factor are displayed together.
The second option is to Suppress absolute values less than a specified value (by default 0.1).

This option ensures that factor loadings within +0.1 are not displayed in the output.
Again, this option is useful for assisting in interpretation.
The default value is not that useful and I recommend changing it either to 0.4 (for
interpretation purposes) or to a value reflecting the expected value of a significant
factor loading given the sample size.
For this example set the value at .40.

Interpreting Output from SPSS

Preliminary Analysis
The first body of output concerns data screening, assumption testing and sampling adequacy.
Youll find several large tables (or matrices) that tell us interesting things about our data.
If you selected the Univariate descriptives option then the first table will contain descriptive
statistics for each variable (the mean, standard deviation and number of cases).
The table also includes the number of missing cases; this summary is a useful way to
determine the extent of missing data.
In summary, all questions in the SPSS Stress Test correlate fairly well with all others (this is partly
because of the large sample) and none of the correlation coefficients are particularly large; therefore,
there is no need to consider eliminating any questions at this stage.

The KMO statistic can be calculated for individual and multiple


variables and represents the ratio of the squared correlation
between variables to the squared partial correlation between
variables. In this instance, the statistic is calculated for all 23
variables simultaneously.
The KMO statistic varies between 0 and 1. A value of 0 indicates that
the sum of partial correlations is large relative to the sum of
correlations, indicating diffusion in the pattern of correlations (bad
news).
A value close to I indicates that patterns of correlations are relatively
compact and so factor analysis should yield distinct and reliable
factors.
Kaiser (1974) recommends accepting values greater than 0.5 as
acceptable (values below this should lead you to either collect more data
or rethink which variables to include).
Furthermore, values between 0.5 and 0.7 are mediocre, values between
0.7 and 0.8 are good, values between 0.8 and 0.9 are great and values
above 0.9 are superb.
For these data the value is 0.93 which falls into the range of being superb:
so, we should be confident that factor analysis is appropriate for these
data.

Factor Extraction
The first part of the factor extraction
process is to determine the linear
components within the data set (the
eigenvectors) by calculating the
eigenvalues of the R-matrix.
We know that there areas many
components (eigenvectors) in the Rmatrix as there are variables, but most
will be unimportant.
To determine the importance of a
particular vector we look at the
magnitude of the associated
eigenvalue.
We can then apply criteria to determine
which factors to retain and which to
discard.

By default SPSS uses Kaisers criterion


of retaining factors with eigenvalues
greater than 1.

The eigenvalues associated with each factor


represent the variance explained by that
particular linear component and SPSS also
displays the eigenvalue in terms of the
percentage of variance explained (so, factor 1
explains 31.696% of total variance).
It should be clear that the first few factors explain
relatively large amounts of variance (especially
factor 1) whereas subsequent factors explain
only small amounts of variance.
SPSS then extracts all factors with eigenvalues
greater than 1, which leaves us with four factors.
The eigenvalues associated with these factors
are again displayed (and the percentage of
variance explained) in the columns labelled
Extraction Sums of Squared Loadings.
The values in this part of the table are the same
as the values before extraction, except that the
values for the discarded factors are ignored
(thus, the table is blank after the fourth factor).
In the final part of the table (labeled Rotation
Sums of Squared Loadings), the eigenvalues of
the factors after rotation are displayed.

In optimizing the factor


structure, one consequence for
these data is that the relative
importance of the four factors is
equalized.
Before rotation, factor 1
accounted for considerably
more variance than the
remaining three (31.696%
compared to 7.560, 5.725, and
5.336%), however after
extraction it accounts for only
16.219% of variance
(compared to 14.523, 11.099
and 8.475% respectively).

The scree plot is shown in


SPSS with an arrow
indicating the point of
inflexion on the curve.
This curve is difficult to
interpret because the curve
begins to tail off after three
factors, but there is another
drop after four factors before
a stable plateau is reached.
Therefore, we could probably
justify retaining either two or
four factors.
Given the large sample, it is
probably safe to assume
Kaiser s criterion; however,
you might like to rerun the
analysis specifying that SPSS
extract only two factors and
compare the results.

Factor Rotation
The first analysis to run was an orthogonal rotation
(Varimax).
However, we could have runn the analysis using oblique
rotation.
Sometimes the analysis differ by Method.

Orthogonal Rotation (Varimax)


SPSS Output shows the rotated component matrix (also called the rotated
factor matrix in factor analysis) which is a matrix of the factor loadings for each
variable onto each factor.
This matrix contains the same information as the component matrix in SPSS except
that it is calculated after rotation.
There are several things to consider about the format of this matrix.

First, factor loadings less than 0.4 have not been displayed because we asked
for these loadings to be suppressed using the option.
If you didnt select this option, or didnt adjust the criterion value to 0.4, then your
output will differ.

Second, the variables are listed in the order of size of their factor loadings.
By default, SPSS orders the variables as they are in the data editor; however, we
asked for the output to be Sorted by size using the option.
If this option was not selected your output will look different.
I have allowed the variable labels to be printed to aid interpretation.

The original logic behind suppressing loadings less than 0.4 was based on
Stevens (1992; a stats guru) suggestion that this cut-off point was appropriate
for interpretative purposes (i.e. loadings greater than 0.4 represent substantive
values).

Unrotated Solution (Before rotation, most variables loaded highly


onto the first factor and the remaining factors didnt really get a look see).

Rotated Solution (However, the


rotation of the factor structure
has clarified things considerably:
there are four factors and
variables load very highly onto
only one factor with the
exception of one question).

The next step is to look at the content of questions that load onto the same
factor to try to identify common themes.
If the mathematical factor produced by the analysis represents some real-world
construct then common themes among highly loading questions can help us
identify what the construct might be.
The questions that load highly on Factor I seem to all relate to using computers or
SPSS.
Therefore we might label this factor -- fear of computers.
The questions that load highly on factor 2 all seem to relate to different aspects of
statistics; therefore, we might label this factor fear of statistics.
The three questions that load highly on factor 3 all seem to relate to mathematics;
therefore, we might label this factor fear of mathematics.
Finally, the questions that load highly on factor 4 all contain some component of social
evaluation from friends; therefore, we might label this factor peer evaluation.

This analysis seems to reveal that the initial questionnaire, in reality, is composed
of four sub-scales: fear of computers, fear of statistics, fear of mathematics, and
fear of negative peer evaluation.
There are two possibilities here.
The first is that the SPSS Stress Test failed to measure what it set out to (namely
SPSS anxiety) but does measure some related constructs.
The second is that these four constructs are sub-components of SPSS anxiety;
however, the factor analysis does not indicate which of these possibilities is true.

In the original analysis I asked for scores to be calculated based on


the Anderson-Rubin method (thus, why they are uncorrelated).
You will find these scores in the data editor.
There should be four new columns of data (one for each factor) labeled
FAC1_1, FAC2_1, FAC3_1 and FAC4_1 respectively.
If you asked for factor scores in the oblique rotation then these
scores will appear in the data editor in four other columns labeled
FAC2_1 and so on.
These factor scores can be listed in the output viewer using the
Analyze Reports - Case Summaries... command path).
Given that there are over 1500 cases you might like to restrict the
output to the first 10 or 20.

It should be pretty clear that subject 9 scored highly on all four factors
and so this person is very anxious about statistics, computing and
maths, but less so about peer evaluation (factor 4).

Factor scores can be used in this way to assess the relative fear of
one person compared to another, or we could add the scores up to
obtain a single score for each subject (that we might assume
represents SPSS anxiety as a whole).
We can also use factor scores in regression when groups of
predictors correlate so highly that there is multicollinearity.

Potrebbero piacerti anche