Lesson 12: Factor Analysis
Lesson 12: Factor AnalysisOverview
Factor Analysis is a method for modeling observed variables, and their covariance structure, in terms of a smaller number of underlying unobservable (latent) “factors.” The factors typically are viewed as broad concepts or ideas that may describe an observed phenomenon. For example, a basic desire of obtaining a certain social level might explain most consumption behavior. These unobserved factors are more interesting to the social scientist than the observed quantitative measurements.
Factor analysis is generally an exploratory/descriptive method that requires many subjective judgments. It is a widely used tool and often controversial because the models, methods, and subjectivity are so flexible that debates about interpretations can occur.
The method is similar to principal components although, as the textbook points out, factor analysis is more elaborate. In one sense, factor analysis is an inversion of principal components. In factor analysis we model the observed variables as linear functions of the “factors.” In principal components, we create new variables that are linear combinations of the observed variables. In both PCA and FA, the dimension of the data is reduced. Recall that in PCA, the interpretation of the principal components is often not very clean. A particular variable may, on occasion, contribute significantly to more than one of the components. Ideally we like each variable to contribute significantly to only one component. A technique called factor rotation is employed towards that goal. Examples of fields where factor analysis is involved include physiology, health, intelligence, sociology, and sometimes ecology among others.
Objectives
 Understand the terminology of factor analysis, including the interpretation of factor loadings, specific variances, and communalities;
 Understand how to apply both principal component and maximum likelihood methods for estimating the parameters of a factor model;
 Understand factor rotation, and interpret rotated factor loadings.
12.1  Notations and Terminology
12.1  Notations and TerminologyNotation
Collect all of the variables X 's into a vector \(\mathbf{X}\) for each individual subject. Let \(\mathbf{X_i}\)_{ }denote observable trait i. These are the data from each subject, and are collected into a vector of traits.
\(\textbf{X} = \left(\begin{array}{c}X_1\\X_2\\\vdots\\X_p\end{array}\right) = \text{vector of traits}\)
This is a random vector, with a population mean. Assume that vector of traits \(\mathbf{X}\) is sampled from a population with population mean vector:
\(\boldsymbol{\mu} = \left(\begin{array}{c}\mu_1\\\mu_2\\\vdots\\\mu_p\end{array}\right) = \text{population mean vector}\)
Here, \(\mathrm { E } \left( X _ { i } \right) = \mu _ { i }\) denotes the population mean of variable i.
Consider m unobservable common factors \(f _ { 1 } , f _ { 2 } , \dots , f _ { m }\). The \(i^{th}\) common factor is \(f _ { i } \). Generally, m is going to be substantially less than p .
The common factors are also collected into a vector,
\(\mathbf{f} = \left(\begin{array}{c}f_1\\f_2\\\vdots\\f_m\end{array}\right) = \text{vector of common factors}\)
Model
Our factor model can be thought of as a series of multiple regressions, predicting each of the observable variables \(X_{i}\) from the values of the unobservable common factors \(f_{i}\) :
\begin{align} X_1 & = \mu_1 + l_{11}f_1 + l_{12}f_2 + \dots + l_{1m}f_m + \epsilon_1\\ X_2 & = \mu_2 + l_{21}f_1 + l_{22}f_2 + \dots + l_{2m}f_m + \epsilon_2 \\ & \vdots \\ X_p & = \mu_p + l_{p1}f_1 + l_{p2}f_2 + \dots + l_{pm}f_m + \epsilon_p \end{align}
Here, the variable means \(\mu_{1}\) through \(\mu_{p}\) can be regarded as the intercept terms for the multiple regression models.
The regression coefficients \(l_{ij}\) (the partial slopes) for all of these multiple regressions are called factor loadings. Here, \(l_{ij}\) = loading of the \(i^{th}\) variable on the \(j^{th}\) factor. These are collected into a matrix as shown here:
\(\mathbf{L} = \left(\begin{array}{cccc}l_{11}& l_{12}& \dots & l_{1m}\\l_{21} & l_{22} & \dots & l_{2m}\\ \vdots & \vdots & & \vdots \\l_{p1} & l_{p2} & \dots & l_{pm}\end{array}\right) = \text{matrix of factor loadings}\)
And finally, the errors \(\varepsilon _{i}\) are called the specific factors. Here, \(\varepsilon _{i}\) = specific factor for variable i. The specific factors are also collected into a vector:
\(\boldsymbol{\epsilon} = \left(\begin{array}{c}\epsilon_1\\\epsilon_2\\\vdots\\\epsilon_p\end{array}\right) = \text{vector of specific factors}\)
In summary, the basic model is like a regression model. Each of our response variables X is predicted as a linear function of the unobserved common factors \(f_{1}\) , \(f_{2}\) through \(f_{m}\) . Thus, our explanatory variables are \(f_{1}\) , \(f_{2}\) through \(f_{m}\). We have m unobserved factors that control the variation among our data.
We will generally reduce this into matrix notation as shown in this form here:
\(\textbf{X} = \boldsymbol{\mu} + \textbf{Lf}+\boldsymbol{\epsilon}\)
12.2  Model Assumptions
12.2  Model AssumptionsMean

The specific factors or random errors all have mean zero: \(E(\epsilon_i) = 0\) ; i = 1, 2, ... , p

The common factors, the f's, also have mean zero: \(E(f_i) = 0\); i = 1, 2, ... , m
A consequence of these assumptions is that the mean response of the ith trait is \(\mu_i\). That is,
\(E(X_i) = \mu_i\)
Variance

The common factors have variance one: \(\text{var}(f_i) = 1\); i = 1, 2, ... , m
i is \(\psi_i\): \(\text{var}(\epsilon_i) = \psi_i\) ; i = 1, 2, ... , p Here, \(\psi_i\) is called the specific variance.
Correlation

The common factors are uncorrelated with one another: \(\text{cov}(f_i, f_j) = 0\) for i ≠ j

The specific factors are uncorrelated with one another: \(\text{cov}(\epsilon_i, \epsilon_j) = 0\) for i ≠ j

The specific factors are uncorrelated with the common factors: \(\text{cov}(\epsilon_i, f_j) = 0\); i = 1, 2, ... , p; j = 1, 2, ... , m
These assumptions are necessary to uniquely estimate the parameters. An infinite number of equally wellfitting models with different values for the parameters may be obtained unless these assumptions are made.
Under this model the variance for the ith observed variable is equal to the sum of the squared loadings for that variable and specific variance:
The variance of trait i is: \(\sigma^2_i = \text{var}(X_i) = \sum_{j=1}^{m}l^2_{ij}+\psi_i\)
This derivation is based on the previous assumptions. \(\sum_{j=1}^{m}l^2_{ij}\) is called the Communality for variable i. Later on, we will see how this is a measure for how well the model performs for that particular variable. The larger the communality, the better the model performance for the ith variable.
The covariance between pairs of traits i and j is: \(\sigma_{ij}= \text{cov}(X_i, X_j) = \sum_{k=1}^{m}l_{ik}l_{jk}\)
The covariance between trait i and factor j is: \(\text{cov}(X_i, f_j) = l_{ij}\)
In matrix notation, our model for the variancecovariance matrix is expressed as shown below:
\(\Sigma = \mathbf{LL'} + \boldsymbol{\Psi}\)
This is the matrix of factor loadings times its transpose, plus a diagonal matrix containing the specific variances.
Here \(\boldsymbol{\Psi}\) equals:
\(\boldsymbol{\Psi} = \left(\begin{array}{cccc}\psi_1 & 0 & \dots & 0 \\ 0 & \psi_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \dots & \psi_p \end{array}\right)\)
A parsimonious (simplified) model for the variancecovariance matrix is obtained and used for estimation.
Notes
 The model assumes that the data is a linear function of the common factors. However, because the common factors are not observable, we cannot check for linearity.
 The variancecovariance matrix is a symmetric matrix, that is the variance between variables i and j is the same as the variance between j and i. For this model:
\(\Sigma = \mathbf{LL'} + \boldsymbol{\Psi}\)
The variancecovariance matrix is going to have p(p +1)/2 unique elements of \(\Sigma\) approximated by:
 mp factor loadings in the matrix \(\mathbf{L}\), and
 p specific variances
This means that there are mp plus p parameters in the variancecovariance matrix. Ideally, mp + p is substantially smaller than p(p +1)/2. However, if mp is too small, the mp + p parameters may not be adequate to describe \(\Sigma\). There may always be the case that this is not the right model and you cannot reduce the data to a linear combination of factors.
 If we have more than one variable in our analysis, that is if p > 1, the model is inherently ambiguous. To explain that, let \(\mathbf{T}\) be any m x m orthogonal matrix. A matrix is orthogonal if its inverse is equal to the transpose of the original matrix.
\(\mathbf{T'T = TT' = I} \)
We can write our factor model in matrix notation:
\(\textbf{X} = \boldsymbol{\mu} + \textbf{Lf}+ \boldsymbol{\epsilon} = \boldsymbol{\mu} + \mathbf{LTT'f}+ \boldsymbol{\epsilon} = \boldsymbol{\mu} + \mathbf{L^*f^*}+\boldsymbol{\epsilon}\)
Note that This does not change the calculation because the identity matrix times any matrix is the original matrix. This results in an alternative factor model, where the relationship between the new factor loadings and the original factor loadings is:
\(\mathbf{L^*} = \textbf{LT}\)
and the relationship between the new common factors and the original common factors is:
\(\mathbf{f^*} = \textbf{T'f}\)
This gives a model that fits equally well. Moreover, because there is an infinite number of orthogonal matrices, then there is an infinite number of alternative models. This model, as it turns out, satisfies all of the assumptions discussed earlier.
Note...
\(E(\mathbf{f^*}) = E(\textbf{T'f}) = \textbf{T'}E(\textbf{f}) = \mathbf{T'0} =\mathbf{0}\),
\(\text{var}(\mathbf{f^*}) = \text{var}(\mathbf{T'f}) = \mathbf{T'}\text{var}(\mathbf{f})\mathbf{T} = \mathbf{T'IT} = \mathbf{T'T} = \mathbf{I}\)
and
\(\text{cov}(\mathbf{f^*, \boldsymbol{\epsilon}}) = \text{cov}(\mathbf{T'f, \boldsymbol{\epsilon}}) = \mathbf{T'}\text{cov}(\mathbf{f, \boldsymbol{\epsilon}}) = \mathbf{T'0} = \mathbf{0}\)
So f* satisfies all of the assumptions, and hence f* is an equally valid collection of common factors. There is a certain apparent ambiguity to these models. This ambiguity is later used to justify a factor rotation to obtain a more parsimonious description of the data.
12.3  Principal Component Method
12.3  Principal Component MethodWe consider two different methods to estimate the parameters of a factor model:
 Principal Component Method
 Maximum Likelihood Estimation
A third method, principal factor method, is also available but not considered in this class.
Principal Component Method
Let \(X_i\) be a vector of observations for the \(i^{th}\) subject:
\(\mathbf{X_i} = \left(\begin{array}{c}X_{i1}\\ X_{i2}\\ \vdots \\ X_{ip}\end{array}\right)\)
\(\mathbf{S}\) denotes our sample variancecovariance matrix and is expressed as:
\(\textbf{S} = \dfrac{1}{n1}\sum\limits_{i=1}^{n}\mathbf{(X_i  \bar{x})(X_i  \bar{x})'}\)
We have p eigenvalues for this variancecovariance matrix as well as corresponding eigenvectors for this matrix.
Eigenvalues of \(\mathbf{S}\):
\(\hat{\lambda}_1, \hat{\lambda}_2, \dots, \hat{\lambda}_p\)
Eigenvectors of \(\mathbf{S}\):
\(\hat{\mathbf{e}}_1, \hat{\mathbf{e}}_2, \dots, \hat{\mathbf{e}}_p\)
Recall that the variancecovariance matrix can be reexpressed in the following form as a function of the eigenvalues and the eigenvectors:
Spectral Decomposition of \(Σ\)
\(\Sigma = \sum_{i=1}^{p}\lambda_i \mathbf{e_{ie'_i}} \cong \sum_{i=1}^{m}\lambda_i \mathbf{e_{ie'_i}} = \left(\begin{array}{cccc}\sqrt{\lambda_1}\mathbf{e_1} & \sqrt{\lambda_2}\mathbf{e_2} & \dots & \sqrt{\lambda_m}\mathbf{e_m}\end{array}\right) \left(\begin{array}{c}\sqrt{\lambda_1}\mathbf{e'_1}\\ \sqrt{\lambda_2}\mathbf{e'_2}\\ \vdots\\ \sqrt{\lambda_m}\mathbf{e'_m}\end{array}\right) = \mathbf{LL'}\)
The idea behind the principal component method is to approximate this expression. Instead of summing from 1 to p, we now sum from 1 to m, ignoring the last p  m terms in the sum, and obtain the third expression. We can rewrite this as shown in the fourth expression, which is used to define the matrix of factor loadings \(\mathbf{L}\), yielding the final expression in matrix notation.
This yields the following estimator for the factor loadings:
\(\hat{l}_{ij} = \hat{e}_{ji}\sqrt{\hat{\lambda}_j}\)
This forms the matrix \(\mathbf{L}\) of factor loading in the factor analysis. This is followed by the transpose of \(\mathbf{L}\). To estimate the specific variances, recall that our factor model for the variancecovariance matrix is
\(\boldsymbol{\Sigma} = \mathbf{LL'} + \boldsymbol{\Psi}\)
in matrix notation. \(\Psi\) is now going to be equal to the variancecovariance matrix minus \(\mathbf{LL'}\).
\( \boldsymbol{\Psi} = \boldsymbol{\Sigma}  \mathbf{LL'}\)
This in turn suggests that the specific variances, the diagonal elements of \(\Psi\), are estimated with this expression:
\(\hat{\Psi}_i = s^2_i  \sum\limits_{j=1}^{m}\lambda_j \hat{e}^2_{ji}\)
We take the sample variance for the ith variable and subtract the sum of the squared factor loadings (i.e., the communality).
12.4  Example: Places Rated Data  Principal Component Method
12.4  Example: Places Rated Data  Principal Component MethodExample 121: Places Rated
Let's revisit the Places Rated Example from Lesson 11. Recall that the Places Rated Almanac (Boyer and Savageau) rates 329 communities according to nine criteria:
 Climate and Terrain
 Housing
 Health Care & Environment
 Crime
 Transportation
 Education
 The Arts
 Recreation
 Economic
Except for housing and crime, the higher the score the better.For housing and crime, the lower the score the better.
Our objective here is to describe the relationships among the variables.
Before carrying out a factor analysis we need to determine m. How many common factors should be included in the model? This requires a determination of how may parameters will be involved.
For p = 9, the variancecovariance matrix \(\Sigma\) contains
\(\dfrac{p(p+1)}{2} = \dfrac{9 \times 10}{2} = 45\)
unique elements or entries. For a factor analysis with m factors, the number of parameters in the factor model is equal to
\(p(m+1) = 9(m+1)\)
Taking m = 4, we have 45 parameters in the factor model, this is equal to the number of original parameters, This would result in no dimension reduction. So in this case, we will select m = 3, yielding 36 parameters in the factor model and thus a dimension reduction in our analysis.
It is also common to look at the results of the principal components analysis. The output from Lesson 11.6 is below. The first three components explain 62% of the variation. We consider this to be sufficient for the current example and will base future analyses on three components.
Component  Eigenvalue  Proportion  Cumulative 
1  3.2978  0.3664  0.3664 
2  1.2136  0.1348  0.5013 
3  1.1055  0.1228  0.6241 
4  0.9073  0.1008  0.7249 
5  0.8606  0.0956  0.8205 
6  0.5622  0.0625  0.8830 
7  0.4838  0.0538  0.9368 
8  0.3181  0.0353  0.9721 
9  0.2511  0.0279  1.0000 
We need to select m so that a sufficient amount of variation in the data is explained. What is sufficient is, of course, subjective and depends on the example at hand.
Alternatively, often in social sciences, the underlying theory within the field of study indicates how many factors to expect. In psychology, for example, a circumplex model suggests that mood has two factors: positive affect and arousal. So a twofactor model may be considered for questionnaire data regarding the subjects' moods. In many respects, this is a better approach because then you are letting the science drive the statistics rather than the statistics drive the science! If you can, use your or a field expert's scientific understanding to determine how many factors should be included in your model.
Using SAS
The factor analysis is carried out using the program as shown below:
Download the SAS Program here: places2.sas
View the video explanation of the SAS code.Using Minitab
View the video below to see how to perform a factor analysis using the Minitab statistical software application.
Initially, we will look at the factor loadings. The factor loadings are obtained by using this expression
\(\hat{e}_{i}\sqrt{ \hat{\lambda}_{i}}\)
These are summarized in the table below. The factor loadings are only recorded for the first three factors because we set m=3. We should also note that the factor loadings are the correlations between the factors and the variables. For example, the correlation between the Arts and the first factor is about 0.86. Similarly, the correlation between climate and that factor is only about 0.28.
Factor  
Variable  1  2  3 
Climate  0.286  0.076  0.841 
Housing  0.698  0.153  0.084 
Health  0.744  0.410  0.020 
Crime  0.471  0.522  0.135 
Transportation  0.681  0.156  0.148 
Education  0.498  0.498  0.253 
Arts  0.861  0.115  0.011 
Recreation  0.642  0.322  0.044 
Economics  0.298  0.595  0.533 
Interpreting factor loadings is similar to interpreting the coefficients for principal component analysis. We want to determine some inclusion criterion, which in many instances, may be somewhat arbitrary. In the above table, the values that we consider large are in boldface, using about .5 as the cutoff. The following statements are based on this criterion:

Factor 1 is correlated most strongly with Arts (0.861) and also correlated with Health, Housing, Recreation, and to a lesser extent Crime and Education. You can say that the first factor is primarily a measure of these variables.

Similarly, Factor 2 is correlated most strongly with Crime, Education, and Economics. You can say that the second factor is primarily a measure of these variables.

Likewise, Factor 3 is correlated most strongly with Climate and Economics. You can say that the first factor is primarily a measure of these variables.
The interpretation above is very similar to that obtained in the standardized principal component analysis.
12.5  Communalities
12.5  CommunalitiesExample 121: Continued...
The communalities for the \(i^{th}\) variable are computed by taking the sum of the squared loadings for that variable. This is expressed below:
\(\hat{h}^2_i = \sum\limits_{j=1}^{m}\hat{l}^2_{ij}\)
To understand the computation of communulaties, recall the table of factor loadings:
Factor  
Variable  1  2  3 
Climate  0.286  0.076  0.841 
Housing  0.698  0.153  0.084 
Health  0.744  0.410  0.020 
Crime  0.471  0.522  0.135 
Transportation  0.681  0.156  0.148 
Education  0.498  0.498  0.253 
Arts  0.861  0.115  0.011 
Recreation  0.642  0.322  0.044 
Economics  0.298  0.595  0.533 
Let's compute the communality for Climate, the first variable. We square the factor loadings for climate (given in boldface in the table above), then add the results:
\(\hat{h}^2_1 = 0.28682^2 + 0.07560^2 + 0.84085^2 = 0.7950\)
Using SAS
The communalities of the 9 variables can be obtained from page 4 of the SAS output as shown below:
Final Communality Estimates: Total = 5.616885  

Climate  housing  health  crime  trans  educate  arts  recreate  econ 
0.79500707  0.51783185  0.72230182  0.51244913  0.50977159  0.56073895  0.75382091  0.51725940  0.72770402 
5.616885, (located just above the individual communalities), is the "Total Communality".
Using Minitab
View the video below to see hhow to get the communalities using the Minitab statistical software application.
>In summary, the communalities are placed into a table:
Variable  Communality 
Climate  0.795 
Housing  0.518 
Health  0.722 
Crime  0.512 
Transportation  0.510 
Education  0.561 
Arts  0.754 
Recreation  0.517 
Economics  0.728 
Total  5.617 
You can think of these values as multiple \(R^{2}\) values for regression models predicting the variables of interest from the 3 factors. The communality for a given variable can be interpreted as the proportion of variation in that variable explained by the three factors. In other words, if we perform multiple regression of climate against the three common factors, we obtain an \(R^{2} = 0.795\), indicating that about 79% of the variation in climate is explained by the factor model. The results suggest that the factor analysis does the best job of explaining variation in climate, the arts, economics, and health.
One assessment of how well this model performs can be obtained from the communalities. We want to see values that are close to one. This indicates that the model explains most of the variation for those variables. In this case, the model does better for some variables than it does for others. The model explains Climate the best, and is not bad for other variables such as Economics, Health and the Arts. However, for other variables such as Crime, Recreation, Transportation and Housing the model does not do a good job, explaining only about half of the variation.
The sum of all communality values is the total communality value:
\(\sum\limits_{i=1}^{p}\hat{h}^2_i = \sum\limits_{i=1}^{m}\hat{\lambda}_i\)
Here, the total communality is 5.617. The proportion of the total variation explained by the three factors is
\(\dfrac{5.617}{9} = 0.624\)
This is the percentage of variation explained in our model. This could be considered an overall assessment of the performance of the model. However, this percentage is the same as the proportion of variation explained by the first three eigenvalues, obtained earlier. The individual communalities tell how well the model is working for the individual variables, and the total communality gives an overall assessment of performance. These are two different assessments.
Because the data are standardized, the variance for the standardized data is equal to one. The specific variances are computed by subtracting the communality from the variance as expressed below:
\(\hat{\Psi}_i = 1\hat{h}^2_i\)
Recall that the data were standardized before analysis, so the variances of the standardized variables are all equal to one. For example, the specific variance for Climate is computed as follows:
\(\hat{\Psi}_1 = 10.795 = 0.205\)
The specific variances are found in the SAS output as the diagonal elements in the table on page 5 as seen below:
Climate  Housing  Health  crime  Trans  Educate  Arts  Recreate  Econ  

Climate  0.20499  0.00924  0.01476  0.06027  0.03720  0.18537  0.07518  0.12475  0.21735 
Housing  0.00924  0.48217  0.02317  0.28063  0.12119  0.04803  0.07518  0.04032  0.04249 
Health  0.01476  0.02317  0.27770  0.05007  0.15480  0.11537  0.00929  0.09108  0.06527 
Crime  0.06027  0.28063  0.05007  0.48755  0.05497  0.11562  0.00009  0.18377  0.10288 
Trans  0.03720  0.12119  0.15480  0.05497  0.49023  0.14318  0.05439  0.01041  0.12641 
Educate  0.18537  0.04803  0.11537  0.11562  0.14318  0.43926  0.13515  0.05531  0.14197 
Arts  0.07518  0.07552  0.00929  0.00009  0.05439  0.13515  0.24618  0.01926  0.04687 
Recreate  0.12475  0.04032  0.09108  0.18377  0.01041  0.05531  0.01926  0.48274  0.18326 
Econ  0.21735  0.04249  0.06527  0.10288  0.12641  0.14197  0.04687  0.18326  0.27230 
For example, the specific variance for housing is 0.482.
This model provides an approximation to the correlation matrix. We can assess the model's appropriateness with the residuals obtained from the following calculation:
\(s_{ij} \sum\limits_{k=1}^{m}l_{ik}l_{jk}; i \ne j = 1, 2, \dots, p\)
This is basically the difference between R and LL', or the correlation between variables i and j minus the expected value under the model. Generally, these residuals should be as close to zero as possible. For example, the residual between Housing and Climate is 0.00924 which is pretty close to zero. However, there are some that are not very good. The residual between Climate and Economy is 0.217. These values give an indication of how well the factor model fits the data.
One disadvantage of the principal component method is that it does not provide a test for lackoffit. We can examine these numbers and determine if we think they are small or close to zero, but we really do not have a test for this. Such a test is available for the maximum likelihood method.
12.6  Final Notes about the Principal Component Method
12.6  Final Notes about the Principal Component MethodUnlike the competing methods, the estimated factor loadings under the principal component method do not change as the number of factors is increased. This is not true of the remaining methods (e.g., maximum likelihood). However, the communalities and the specific variances will depend on the number of factors in the model. In general, as you increase the number of factors, the communalities increase towards one and the specific variances will decrease towards zero.
The diagonal elements of the variancecovariance matrix \(\mathbf{S}\) (or \(\mathbf{R}\)) are equal to the diagonal elements of the model:
\(\mathbf{\hat{L}\hat{L}' + \mathbf{\hat{\Psi}}}\)
The offdiagonal elements are not exactly reproduced. This is in part due to variability in the data  just random chance. Therefore, we want to select the number of factors to make the offdiagonal elements of the residual matrix small:
\(\mathbf{S  (\hat{L}\hat{L}' + \hat{\Psi})}\)
Here, we have a tradeoff between two conflicting desires. For a parsimonious model, we wish to select the number of factors m to be as small as possible, but for such a model, the residuals could be large. Conversely, by selecting m to be large, we may reduce the sizes of the residuals but at the cost of producing a more complex and less interpretable model (there are more factors to interpret).
Another result to note is that the sum of the squared elements of the residual matrix is equal to sum of the squared values of the eigenvalues left out of the matrix:
\(\sum\limits_{j=m+1}^{p}\hat{\lambda}^2_j\)
General Methods used in determining the number of Factors
Below are three common techniques used to determine the number of factors to extract:
 Cumulative proportion of at least 0.80 (or 80% explained variance)
 Eigenvalues of at least one
 Scree plot based on the "elbow" of the plot; that is, where the plot turns and begins to flatten out
12.7  Maximum Likelihood Estimation Method
12.7  Maximum Likelihood Estimation MethodAssumption
Maximum Likelihood Estimation requires that the data are sampled from a multivariate normal distribution. This is a drawback of this method. Data is often collected on a Likert scale, especially in the social sciences. Because a Likert scale is discrete and bounded, these data cannot be normally distributed.
Using the Maximum Likelihood Estimation Method, we must assume that the data are independently sampled from a multivariate normal distribution with mean vector \(\mu\) and variancecovariance matrix of the form:
\(\boldsymbol{\Sigma} = \mathbf{LL' +\boldsymbol{\Psi}}\)
where \(\mathbf{L}\) is the matrix of factor loadings and \(\Psi\) is the diagonal matrix of specific variances.
We define additional notation: As usual, the data vectors for n subjects are represented as shown:
\(\mathbf{X_1},\mathbf{X_2}, \dots, \mathbf{X_n}\)
Maximum likelihood estimation involves estimating the mean, the matrix of factor loadings, and the specific variance.
The maximum likelihood estimator for the mean vector \(\mu\), the factor loadings \(\mathbf{L}\), and the specific variances \(\Psi\) are obtained by finding \(\hat{\mathbf{\mu}}\), \(\hat{\mathbf{L}}\), and \(\hat{\mathbf{\Psi}}\) that maximize the log likelihood given by the following expression:
\(l(\mathbf{\mu, L, \Psi}) =  \dfrac{np}{2}\log{2\pi} \dfrac{n}{2}\log{\mathbf{LL' + \Psi}}  \dfrac{1}{2}\sum_{i=1}^{n}\mathbf{(X_i\mu)'(LL'+\Psi)^{1}(X_i\mu)}\)
The log of the joint probability distribution of the data is maximized. We want to find the values of the parameters, (\(\mu\), \(\mathbf{L}\), and \(\Psi\)), that is most compatible with what we see in the data. As was noted earlier the solutions for these factor models are not unique. Equivalent models can be obtained by rotation. If \(\mathbf{L'\Psi^{1}L}\) is a diagonal matrix, then we may obtain a unique solution.
Computationally this process is complex. In general, there is no closedform solution to this maximization problem so iterative methods are applied. Implementation of iterative methods can run into problems as we will see later.
12.8  Example: Places Rated Data
12.8  Example: Places Rated DataExample 122: Places Rated
Using SAS
This method of factor analysis is being carried out using the program shown below:
Download the SAS Program here: places3.sas
Here we have specified the Maximum Likelihood Method by setting method=ml. Again, we need to specify the number of factors.
You will notice that this program produces errors and does not complete the factor analysis. We will start out without the Heywood or priors options discussed below to see the error that occurs and how to remedy it.
For m = 3 factors, maximum likelihood estimation fails to converge. An examination of the records of each iteration reveals that the commonality of the first variable (climate) exceeds one during the first iteration. Because the communality must lie between 0 and 1, this is the cause for failure.
SAS provides a number of different fixes for this kind of error. Most fixes adjust the initial guess, or starting value, for the commonalities.
Remedies
 Attempt adding a priors option to the procedure to set an initial guess for the commonalities. Available priors include:
 priors=smc: Sets the prior commonality of each variable proportional to the R^{2} of that variable with all other variables as an initial guess.
 priors=asmc: As above with an adjustment so that the sum of the commonalities is equal to the maximum of the absolute correlations.
 priors=max: Sets the prior commonality of each variable to the maximum absolute correlation within any other variable.
 priors=random: Sets the prior commonality of each variable to a random number between 0 and 1.
This option is added within the proc factor line of code (proc factor method=ml nfactors=3 priors=smc;). If we begin with better starting values, then we might have better luck at convergence. Unfortunately, in trying each of these options, (including running the random option multiple times), we find that these options are ineffective for our Places Rated Data. The second option needs to be considered.
 Attempt adding the Heywood option to the procedure (proc factor method=ml nfactors=3 heywood;). This sets communalities greater than one back to one, allowing iterations to proceed. In other words, if the commonality value falls out of bounds, then it will be replaced by a value of one. This will always yield a solution, but frequently the solution will not adequately fit the data.
We start with the same values for the commonalities and then at each iteration we obtain new values for the commonalities. The criterion is a value that we are trying to minimize in order to obtain our estimates. We can see that the convergence criterion decreases with each iteration of the algorithm.
Iteration  Criterion  Ridge  Change  Communalities  

1  0.3291161  0.0000  0.2734  0.47254  0.40913  0.73500  0.22107 
0.38516  0.26178  0.75125  0.46384  
0.15271  
2  0.2946707  0.0000  0.5275  1.00000  0.37872  0.75101  0.20469 
0.36111  0.26155  0.75298  0.48979  
0.11995  
3  0.2877116  0.0000  0.0577  1.00000  0.41243  0.80868  0.22168 
0.38551  0.26263  0.74546  0.53277  
0.11601  
4  0.2876330  0.0000  0.0055  1.00000  0.41336  0.81414  0.21647 
0.38365  0.26471  0.74493  0.53724  
0.11496  
5  0.2876314  0.0000  0.0007  1.00000  0.41392  0.81466  0.21595 
0.38346  0.26475  0.74458  0.53794  
0.11442 
You can see in the second iteration that rather than report a commonality greater than one, SAS replaces it with the value one and then proceeds as usual through the iterations.
After five iterations the algorithm converges, as indicated by the statement on the second page of the output. The algorithm converged to a setting where the commonality for Climate is equal to one.
Using Minitab
Click on the video below to see how to perform a factor analysis with maximum likelihood using the Minitab statistical software application.
12.9  GoodnessofFit
12.9  GoodnessofFitBefore we proceed, we would like to determine if the model adequately fits the data. The goodnessoffit test in this case compares the variancecovariance matrix under a parsimonious model to the variancecovariance matrix without any restriction, i.e. under the assumption that the variances and covariances can take any values. The variancecovariance matrix under the assumed model can be expressed as:
\(\mathbf{\Sigma = LL' + \Psi}\)
\(\mathbf{L}\) is the matrix of factor loadings, and the diagonal elements of \(Ψ\) are equal to the specific variances. This is a very specific structure for the variancecovariance matrix. A more general structure would allow those elements to take any value. To assess goodnessoffit, we use the BartlettCorrected Likelihood Ratio Test Statistic:
\(X^2 = \left(n1\frac{2p+4m5}{6}\right)\log \frac{\mathbf{\hat{L}\hat{L}'}+\mathbf{\hat{\Psi}}}{\hat{\mathbf{\Sigma}}}\)
The test is a likelihood ratio test, where two likelihoods are compared, one under the parsimonious model and the other without any restrictions. The constant is the statistic is called the Bartlett correction. The log is the natural log. In the numerator we have the determinant of the fitted factor model for the variancecovariance matrix, and below, we have a sample estimate of the variancecovariance matrix assuming no structure where:
\(\hat{\boldsymbol{\Sigma}} = \frac{n1}{n}\mathbf{S}\)
and \(\mathbf{S}\) is the sample variancecovariance matrix. This is just another estimate of the variancecovariance matrix which includes a small bias. If the factor model fits well then these two determinants should be about the same and you will get a small value for \(X_{2}\). However, if the model does not fit well, then the determinants will be difference and \(X_{2}\) will be large.
Under the null hypothesis that the factor model adequately describes the relationships among the variables,
\(\mathbf{X}^2 \sim \chi^2_{\frac{(pm)^2pm}{2}} \)
Under the null hypothesis, that the factor model adequately describes the data, this test statistic has a chisquare distribution with an unusual set of degrees of freedom as shown above. The degrees of freedom are the difference in the number of unique parameters in the two models. We reject the null hypothesis that the factor model adequately describes the data if \(X_{2}\) exceeds the critical value from the chisquare table.
Back to the Output... Looking just past the iteration results, we have...
Test  DF  ChiSquare  Pr > ChiSq 

\(H_{o}\colon\) No common factors  36  839.4268  < 0.0001 
\(H_{A}\colon\) At least one common factor  
\(H_{o}\colon\) 3 Factors are sufficient  12  92.6652  < 0.0001 
\(H_{A}\colon\) More Factors are needed 
For our Places Rated dataset, we find a significant lack of fit. \(X _ { 2 } = 92.67 ; d . f = 12 ; p < 0.0001\). We conclude that the relationships among the variables is not adequately described by the factor model. This suggests that we do not have the correct model.
The only remedy that we can apply in this case is to increase the number m of factors until an adequate fit is achieved. Note, however, that m must satisfy
\(p(m+1) \le \frac{p(p+1)}{2}\)
In the present example, this means that m ≤ 4.
Let's return to the SAS program and change the "nfactors" value from 3 to 4:
Test  DF  ChiSquare  Pr > ChiSq 

\(H_{o}\colon\) No common factors  36  839.4268  < 0.0001 
\(H_{A}\colon\) At least one common factor  
\(H_{o}\colon\) 4 Factors are sufficient  6  41.6867  < 0.0001 
\(H_{A}\colon\) More Factors are needed 
We find that the factor model with m = 4 does not fit the data adequately either, \(X _ { 2 } = 41.69 ; d . f . = 6 ; p < 0.0001\). We cannot properly fit a factor model to describe this particular data and conclude that a factor model does not work with this particular dataset. There is something else going on here, perhaps some nonlinearity. Whatever the case, it does not look like this yields a goodfitting factor model. A next step could be to drop variables from the data set to obtain a better fitting model.
12.10  Factor Rotations
12.10  Factor RotationsFrom our experience with the Places Rated data, it does not look like the factor model works well. There is no guarantee that any model will fit the data well.
The first motivation of factor analysis was to try to discern some underlying factors describing the data. The Maximum Likelihood Method failed to find such a model to describe the Places Rated data. The second motivation is still valid, that is to try to obtain a better interpretation of the data. In order to do this, let's take a look at the factor loadings obtained before from the principal component method.
Factor  
Variable  1  2  3 
Climate  0.286  0.076  0.841 
Housing  0.698  0.153  0.084 
Health  0.744  0.410  0.020 
Crime  0.471  0.522  0.135 
Transportation  0.681  0.156  0.148 
Education  0.498  0.498  0.253 
Arts  0.861  0.115  0.011 
Recreation  0.642  0.322  0.044 
Economics  0.298  0.595  0.533 
The problem with this analysis is that some of the variables are highlighted in more than one column. For instance, Education appears significant to Factor 1 AND Factor 2. The same is true for Economics in both Factors 2 AND 3. This does not provide a very clean, simple interpretation of the data. Ideally, each variable would appear as a significant contributer in one column.
In fact the above table may indicate contradictory results. Looking at some of the observations, it is conceivable that we will find an observation that takes a high value on both Factors 1 and 2. If this occurs, a high value for Factor 1 suggests that the community has quality education, whereas the high value for Factor 2 suggests the opposite, that the community has poor education.
Factor rotation is motivated by the fact that factor models are not unique. Recall that the factor model for the data vector, \(\mathbf{X = \boldsymbol{\mu} + LF + \boldsymbol{\epsilon}}\), is a function of the mean \(\boldsymbol{\mu}\), plus a matrix of factor loadings times a vector of common factors, plus a vector of specific factors.
Moreover, we should note that this is equivalent to a rotated factor model, \(\mathbf{X = \boldsymbol{\mu} + L^*F^* + \boldsymbol{\epsilon}}\), where we have set \(\mathbf{L^* = LT}\) and \(\mathbf{f^* = T'f}\) for some orthogonal matrix \(\mathbf{T}\) where \(\mathbf{T'T = TT' = I}\). Note that there are an infinite number of possible orthogonal matrices, each corresponding to a particular factor rotation.
We plan to find an appropriate rotation, defined through an orthogonal matrix \(\mathbf{T}\), that yields the most easily interpretable factors.
To understand this, consider a scatter plot of factor loadings. The orthogonal matrix \(\mathbf{T}\) rotates the axes of this plot. We wish to find a rotation such that each of the p variables has a high loading on only one factor.
Using SAS
We will return to the program below to obtain a plot. In looking at the program, there are a number of options (marked in blue under proc factor) that we did not yet explain.
Download the SAS program here: places2.sas
One of the options above is labeled 'preplot'. We will use this to plot the values for factor 1 against factor 2.
In the output these values are plotted, the loadings for factor 1 on the yaxis, and the loadings for factor 2 on the xaxis.
Similarly, the second variable, labeled with the letter B, has a factor 1 loading of about 0.7 and a factor 2 loading of about 0.15. Each letter on the plot corresponds to a single variable. SAS provides plots of the other combinations of factors, factor 1 against factor 3 as well as factor 2 against factor 3.
Three factors appear in this model so what we might consider a threedimensional plot of all three factors together.
Using Minitab
To obtain the scree plot and the loading plot using the Minitab statistical software application.
The selection of the orthogonal matrixes \(\mathbf{T}\) corresponds to our rotation of these axis. Think about rotating the axis about the center. Each rotation will correspond to an orthogonal matrix \(\mathbf{T}\). We want to rotate the axes to obtain a cleaner interpretation of the data. We would really like to define new coordinate systems so that when we rotate everything, the points fall close to the vertices (end points) of the new axes.
If we were only looking at two factors, then we would like to find each of the plotted points at the four tips (corresponding to all four directions) of the rotated axes. This is what rotation is about, taking the factor pattern plot and rotating the axes in such a way so that the points fall close to the axes.
12.11  Varimax Rotation
12.11  Varimax Rotation Varimax Rotation
 Varimax rotation is the most common. It involves scaling the loadings by dividing them by the corresponding communality as shown below:
 \(\tilde{l}^*_{ij}= \hat{l}^*_{ij}/\hat{h}_i\)
 Varimax rotation finds the rotation that maximizes this quantity. The Varimax procedure, as defined below, selects the rotation in order to maximize
 \(V = \frac{1}{p}\sum\limits_{j=1}^{m}\left\{\sum\limits_{i=1}^{p}(\tilde{l}^*_{ij})^4  \frac{1}{p}\left(\sum\limits_{i=1}^{p}(\tilde{l}^*_{ij})^2 \right)^2 \right\}\)
This is the sample variances of the standardized loadings for each factor summed over the m factors.
Using SAS
Returning to the options of the factor procedure (marked in blue):
"rotate" asks for factor rotation and we specified the Varimax rotation of our factor loadings.
"plot" asks for the same kind of plot that we just looked at for the rotated factors. The result of our rotation is a new factor pattern given below (page 11 of SAS output):
Here is a copy of page 10 from the SAS output:
At the top of page 10 of the output, above, we have our orthogonal matrix T .
Using Minitab
View the video below to see how to use the Varimax rotation using the Minitab statistical software application.
The values of the rotated factor loadings are:
Factor  
Variable  1  2  3 
Climate  0.021  0.239  0.859 
Housing  0.438  0.547  0.166 
Health  0.829  0.127  0.137 
Crime  0.031  0.702  0.139 
Transportation  0.652  0.289  0.028 
Education  0.734  0.094  0.117 
Arts  0.738  0.432  0.150 
Recreation  0.301  0.656  0.099 
Economics  0.022  0.651  0.551 
Let us now interpret the data based on the rotation. We highlighted the values that are large in magnitude and make the following interpretation.
 Factor 1: primarily a measure of Health, but also increases with increasing scores for Transportation, Education, and the Arts.
 Factor 2: primarily a measure of Crime, Recreation, the Economy, and Housing.
 Factor 3: primarily a measure of Climate alone.
This is just the pattern that exists in the data and no causal inferences should be made from this interpretation. It does not tell us why this pattern exists. It could very well be that there are other essential factors that are not seen at work here.
Let us look at the amount of variation explained by our factors under the rotated model and compare it to the original model. Consider the variance explained by each factor under the original analysis and the rotated factors:
Analysis  
Factor  Original  Rotated 
1  3.2978  2.4798 
2  1.2136  1.9835 
3  1.1055  1.1536 
Total  5.6169  5.6169 
The total amount of variation explained by the 3 factors remains the same. Rotations, among a fixed number of factors, do not change how much of the variation is explained by the model. The fit is equally good regardless of what rotation is used.
However, notice what happened to the first factor. We see a fairly large decrease in the amount of variation explained by the first factor. We obtained a cleaner interpretation of the data but it costs us something somewhere. The cost is that the variation explained by the first factor is distributed among the latter two factors, in this case mostly to the second factor.
The total amount of variation explained by the rotated factor model is the same, but the contributions are not the same from the individual factors. We gain a cleaner interpretation, but the first factor does not explain as much of the variation. However, this would not be considered a particularly large cost if we are still interested in these three factors.
Rotation cleans up the interpretation. Ideally, we should find that the numbers in each column are either far away from zero or close to zero. Numbers close to +1 or 1 or 0 in each column give the ideal or cleanest interpretation. If a rotation can achieve this goal, then that is wonderful. However, observed data are seldom this cooperative!
Nevertheless, recall that the objective is data interpretation. The success of the analysis can be judged by how well it helps you to make sense of your data If the result gives you some insight as to the pattern of variability in the data, even without being perfect, then the analysis was successful.
12.12  Estimation of Factor Scores
12.12  Estimation of Factor ScoresFactor scores are similar to the principal components in the previous lesson. Just we plotted principal components against each other, a similar scatter plot of factor scores is also helpful. We also might use factor scores as explanatory variables in future analyses. It may even be of interest to use the factor score as the dependent variable in a future analysis.
The methods for estimating factor scores depend on the method used to carry out the principal components analysis. The vectors of common factors f is of interest. There are m unobserved factors in our model and we would like to estimate those factors. Therefore, given the factor model:
\(\mathbf{Y_i = \boldsymbol{\mu} + Lf_i + \boldsymbol{\epsilon_i}}; i = 1,2,\dots, n,\)
we may wish to estimate the vectors of factor scores
\(\mathbf{f_1, f_2, \dots, f_n}\)
for each observation.
Methods
There are a number of different methods for estimating factor scores from the data. These include:
 Ordinary Least Squares
 Weighted Least Squares
 Regression method
Ordinary Least Squares
By default, this is the method that SAS uses if you use the principal component method. The difference between the \(j^{th}\) variable on the \(i^{th}\) subject and its value under the factor model is computed. The \(\mathbf{L}\)'s are factor loadings and the f's are the unobserved common factors. The vector of common factors for subject i, or \( \hat{\mathbf{f}}_i \), is found by minimizing the sum of the squared residuals:
\[\sum_{j=1}^{p}\epsilon^2_{ij} = \sum_{j=1}^{p}(y_{ij}\mu_jl_{j1}f_1  l_{j2}f_2  \dots  l_{jm}f_m)^2 = (\mathbf{Y_i  \boldsymbol{\mu}  Lf_i})'(\mathbf{Y_i  \boldsymbol{\mu}  Lf_i})\]
This is like a least squares regression, except in this case we already have estimates of the parameters (the factor loadings), but wish to estimate the explanatory common factors. In matrix notation the solution is expressed as:
\(\mathbf{\hat{f}_i = (L'L)^{1}L'(Y_i\boldsymbol{\mu})}\)
In practice, we substitute in our estimated factor loadings into this expression as well as the sample mean for the data:
\(\mathbf{\hat{f}_i = \left(\hat{L}'\hat{L}\right)^{1}\hat{L}'(Y_i\bar{y})}\)
Using the principal component method with the unrotated factor loadings, this yields:
\[\mathbf{\hat{f}_i} = \left(\begin{array}{c} \frac{1}{\sqrt{\hat{\lambda}_1}}\mathbf{\hat{e}'_1(Y_i\bar{y})}\\ \frac{1}{\sqrt{\hat{\lambda}_2}}\mathbf{\hat{e}'_2(Y_i\bar{y})}\\ \vdots \\ \frac{1}{\sqrt{\hat{\lambda}_m}}\mathbf{\hat{e}'_m(Y_i\bar{y})}\end{array}\right)\]
\(e_i\) through \(e_m\) are our first m eigenvectors.
Weighted Least Squares (Bartlett)
The difference between WLS and OLS is that the squared residuals are divided by the specific variances as shown below. This is going to give more weight, in this estimation, to variables that have low specific variances. The factor model fits the data best for variables with low specific variances. The variables with low specific variances should give us more information regarding the true values for the specific factors.
Therefore, for the factor model:
\(\mathbf{Y_i = \boldsymbol{\mu} + Lf_i + \boldsymbol{\epsilon_i}}\)
we want to find \(\boldsymbol{f_i}\) that minimizes
\( \sum\limits_{j=1}^{p}\frac{\epsilon^2_{ij}}{\Psi_j} = \sum\limits_{j=1}^{p}\frac{(y_{ij}\mu_j  l_{j1}f_1  l_{j2}f_2 \dots  l_{jm}f_m)^2}{\Psi} = \mathbf{(Y_i\boldsymbol{\mu}Lf_i)'\Psi^{1}(Y_i\boldsymbol{\mu}Lf_i)}\)
The solution is given by this expression where \(\mathbf{\Psi}\) is the diagonal matrix whose diagonal elements are equal to the specific variances:
\(\mathbf{\hat{f}_i = (L'\Psi^{1}L)^{1}L'\Psi^{1}(Y_i\boldsymbol{\mu})}\)
and can be estimated by substituting in the following:
\(\mathbf{\hat{f}_i = (\hat{L}'\hat{\Psi}^{1}\hat{L})^{1}\hat{L}'\hat{\Psi}^{1}(Y_i\bar{y})}\)
Regression Method
This method is used for maximum likelihood estimates of factor loadings. A vector of the observed data, supplemented by the vector of factor loadings for the ith subject, is considered.
The joint distribution of the data \(\boldsymbol{Y}_i\) and the factor \(\boldsymbol{f}_i\) is
\(\left(\begin{array}{c}\mathbf{Y_i} \\ \mathbf{f_i}\end{array}\right) \sim N \left[\left(\begin{array}{c}\mathbf{\boldsymbol{\mu}} \\ 0 \end{array}\right), \left(\begin{array}{cc}\mathbf{LL'+\Psi} & \mathbf{L} \\ \mathbf{L'} & \mathbf{I}\end{array}\right)\right]\)
Using this we can calculate the conditional expectation of the common factor score \(\boldsymbol{f}_i\) given the data \(\boldsymbol{Y}_i\) as expressed here:
\(E(\mathbf{f_iY_i}) = \mathbf{L'(LL'+\Psi)^{1}(Y_i\boldsymbol{\mu})}\)
This suggests the following estimator by substituting in the estimates for L and \(\mathbf{\Psi}\):
\(\mathbf{\hat{f}_i = \hat{L}'\left(\hat{L}\hat{L}'+\hat{\Psi}\right)^{1}(Y_i\bar{y})}\)
There is a little bit of a fix that often takes place to reduce the effects of incorrect determination of the number of factors. This tends to give you results that are a bit more stable.
\(\mathbf{\tilde{f}_i = \hat{L}'S^{1}(Y_i\bar{y})}\)
12.13  Summary
12.13  SummaryIn this lesson we learned about:
 The interpretation of factor loadings;
 The principal component and maximum likelihood methods for estimating factor loadings and specific variances
 How communalities can be used to assess the adequacy of a factor model
 A likelihood ratio test for the goodnessoffit of a factor model
 Factor rotation
 Methods for estimating common factors