# Lesson 12: Factor Analysis

Lesson 12: Factor Analysis

## Overview

Factor Analysis is a method for modeling observed variables, and their covariance structure, in terms of a smaller number of underlying unobservable (latent) “factors.” The factors typically are viewed as broad concepts or ideas that may describe an observed phenomenon. For example, a basic desire of obtaining a certain social level might explain most consumption behavior. These unobserved factors are more interesting to the social scientist than the observed quantitative measurements.

Factor analysis is generally an exploratory/descriptive method that requires many subjective judgments. It is a widely used tool and often controversial because the models, methods, and subjectivity are so flexible that debates about interpretations can occur.

The method is similar to principal components although, as the textbook points out, factor analysis is more elaborate. In one sense, factor analysis is an inversion of principal components. In factor analysis we model the observed variables as linear functions of the “factors.” In principal components, we create new variables that are linear combinations of the observed variables.  In both PCA and FA, the dimension of the data is reduced. Recall that in PCA, the interpretation of the principal components is often not very clean. A particular variable may, on occasion, contribute significantly to more than one of the components. Ideally we like each variable to contribute significantly to only one component. A technique called factor rotation is employed towards that goal. Examples of fields where factor analysis is involved include physiology, health, intelligence, sociology, and sometimes ecology among others.

## Objectives

Upon completion of this lesson, you should be able to:

• Understand the terminology of factor analysis, including the interpretation of factor loadings, specific variances, and communalities;
• Understand how to apply both principal component and maximum likelihood methods for estimating the parameters of a factor model;

# 12.1 - Notations and Terminology

12.1 - Notations and Terminology

## Notation

Collect all of the variables X 's into a vector $$\mathbf{X}$$ for each individual subject. Let $$\mathbf{X_i}$$ denote observable trait i. These are the data from each subject, and are collected into a vector of traits.

$$\textbf{X} = \left(\begin{array}{c}X_1\\X_2\\\vdots\\X_p\end{array}\right) = \text{vector of traits}$$

This is a random vector, with a population mean. Assume that vector of traits $$\mathbf{X}$$ is sampled from a population with population mean vector:

$$\boldsymbol{\mu} = \left(\begin{array}{c}\mu_1\\\mu_2\\\vdots\\\mu_p\end{array}\right) = \text{population mean vector}$$

Here, $$\mathrm { E } \left( X _ { i } \right) = \mu _ { i }$$ denotes the population mean of variable i.

Consider m unobservable common factors $$f _ { 1 } , f _ { 2 } , \dots , f _ { m }$$. The $$i^{th}$$ common factor is $$f _ { i }$$. Generally, m is going to be substantially less than p .

The common factors are also collected into a vector,

$$\mathbf{f} = \left(\begin{array}{c}f_1\\f_2\\\vdots\\f_m\end{array}\right) = \text{vector of common factors}$$

## Model

Our factor model can be thought of as a series of multiple regressions, predicting each of the observable variables $$X_{i}$$ from the values of the unobservable common factors $$f_{i}$$ :

\begin{align} X_1 & =  \mu_1 + l_{11}f_1 + l_{12}f_2 + \dots + l_{1m}f_m + \epsilon_1\\ X_2 & =  \mu_2 + l_{21}f_1 + l_{22}f_2 + \dots + l_{2m}f_m + \epsilon_2 \\ &  \vdots \\ X_p & =  \mu_p + l_{p1}f_1 + l_{p2}f_2 + \dots + l_{pm}f_m + \epsilon_p \end{align}

Here, the variable means $$\mu_{1}$$ through $$\mu_{p}$$ can be regarded as the intercept terms for the multiple regression models.

The regression coefficients $$l_{ij}$$ (the partial slopes) for all of these multiple regressions are called factor loadings. Here, $$l_{ij}$$ = loading of the $$i^{th}$$ variable on the $$j^{th}$$ factor. These are collected into a matrix as shown here:

$$\mathbf{L} = \left(\begin{array}{cccc}l_{11}& l_{12}& \dots & l_{1m}\\l_{21} & l_{22} & \dots & l_{2m}\\ \vdots & \vdots & & \vdots \\l_{p1} & l_{p2} & \dots & l_{pm}\end{array}\right) = \text{matrix of factor loadings}$$

And finally, the errors $$\varepsilon _{i}$$ are called the specific factors. Here, $$\varepsilon _{i}$$ = specific factor for variable i. The specific factors are also collected into a vector:

$$\boldsymbol{\epsilon} = \left(\begin{array}{c}\epsilon_1\\\epsilon_2\\\vdots\\\epsilon_p\end{array}\right) = \text{vector of specific factors}$$

In summary, the basic model is like a regression model. Each of our response variables X is predicted as a linear function of the unobserved common factors $$f_{1}$$ , $$f_{2}$$ through $$f_{m}$$ . Thus, our explanatory variables are $$f_{1}$$ , $$f_{2}$$ through $$f_{m}$$. We have m unobserved factors that control the variation among our data.

We will generally reduce this into matrix notation as shown in this form here:

$$\textbf{X} = \boldsymbol{\mu} + \textbf{Lf}+\boldsymbol{\epsilon}$$

Note! In general we want m << p.

# 12.2 - Model Assumptions

12.2 - Model Assumptions

## Mean

1. The specific factors or random errors all have mean zero: $$E(\epsilon_i) = 0$$ ; i = 1, 2, ... , p

2. The common factors, the f's, also have mean zero: $$E(f_i) = 0$$; i = 1, 2, ... , m

A consequence of these assumptions is that the mean response of the ith trait is $$\mu_i$$. That is,

$$E(X_i) = \mu_i$$

## Variance

1. The common factors have variance one: $$\text{var}(f_i) = 1$$; i = 1, 2, ... , m

2. i is $$\psi_i$$: $$\text{var}(\epsilon_i) = \psi_i$$ ; i = 1, 2, ... , p Here, $$\psi_i$$ is called the specific variance.

## Correlation

1. The common factors are uncorrelated with one another: $$\text{cov}(f_i, f_j) = 0$$  for ij

2. The specific factors are uncorrelated with one another: $$\text{cov}(\epsilon_i, \epsilon_j) = 0$$  for ij

3. The specific factors are uncorrelated with the common factors: $$\text{cov}(\epsilon_i, f_j) = 0$$;  i = 1, 2, ... , p; j = 1, 2, ... , m

These assumptions are necessary to uniquely estimate the parameters. An infinite number of equally well-fitting models with different values for the parameters may be obtained unless these assumptions are made.

Under this model the variance for the ith observed variable is equal to the sum of the squared loadings for that variable and specific variance:

The variance of trait i is: $$\sigma^2_i = \text{var}(X_i) = \sum_{j=1}^{m}l^2_{ij}+\psi_i$$

This derivation is based on the previous assumptions. $$\sum_{j=1}^{m}l^2_{ij}$$ is called the Communality for variable i.  Later on, we will see how this is a measure for how well the model performs for that particular variable. The larger the communality, the better the model performance for the ith variable.

The covariance between pairs of traits i and j is: $$\sigma_{ij}= \text{cov}(X_i, X_j) = \sum_{k=1}^{m}l_{ik}l_{jk}$$

The covariance between trait i and factor j is: $$\text{cov}(X_i, f_j) = l_{ij}$$

In matrix notation, our model for the variance-covariance matrix is expressed as shown below:

$$\Sigma = \mathbf{LL'} + \boldsymbol{\Psi}$$

This is the matrix of factor loadings times its transpose, plus a diagonal matrix containing the specific variances.

Here $$\boldsymbol{\Psi}$$ equals:

$$\boldsymbol{\Psi} = \left(\begin{array}{cccc}\psi_1 & 0 & \dots & 0 \\ 0 & \psi_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \dots & \psi_p \end{array}\right)$$

A parsimonious (simplified) model for the variance-covariance matrix is obtained and used for estimation.

## Notes

1. The model assumes that the data is a linear function of the common factors. However, because the common factors are not observable, we cannot check for linearity.
2. The variance-covariance matrix is a symmetric matrix, that is the variance between variables i and j is the same as the variance between j and i.  For this model:

$$\Sigma = \mathbf{LL'} + \boldsymbol{\Psi}$$

The variance-covariance matrix is going to have p(p +1)/2 unique elements of $$\Sigma$$ approximated by:

• mp factor loadings in the matrix $$\mathbf{L}$$, and
• p specific variances

This means that there are mp plus p parameters in the variance-covariance matrix. Ideally, mp + p is substantially smaller than p(p +1)/2. However, if mp is too small, the mp + p parameters may not be adequate to describe $$\Sigma$$. There may always be the case that this is not the right model and you cannot reduce the data to a linear combination of factors.

3. If we have more than one variable in our analysis, that is if p > 1, the model is inherently ambiguous. To explain that, let $$\mathbf{T}$$ be any m x m orthogonal matrix. A matrix is orthogonal if its inverse is equal to the transpose of the original matrix.

$$\mathbf{T'T = TT' = I}$$

We can write our factor model in matrix notation:

$$\textbf{X} = \boldsymbol{\mu} + \textbf{Lf}+ \boldsymbol{\epsilon} = \boldsymbol{\mu} + \mathbf{LTT'f}+ \boldsymbol{\epsilon} = \boldsymbol{\mu} + \mathbf{L^*f^*}+\boldsymbol{\epsilon}$$

Note that This does not change the calculation because the identity matrix times any matrix is the original matrix. This results in an alternative factor model, where the relationship between the new factor loadings and the original factor loadings is:

$$\mathbf{L^*} = \textbf{LT}$$

and the relationship between the new common factors and the original common factors is:

$$\mathbf{f^*} = \textbf{T'f}$$

This gives a model that fits equally well. Moreover, because there is an infinite number of orthogonal matrices, then there is an infinite number of alternative models. This model, as it turns out, satisfies all of the assumptions discussed earlier.

Note...

$$E(\mathbf{f^*}) = E(\textbf{T'f}) = \textbf{T'}E(\textbf{f}) = \mathbf{T'0} =\mathbf{0}$$,

$$\text{var}(\mathbf{f^*}) = \text{var}(\mathbf{T'f}) = \mathbf{T'}\text{var}(\mathbf{f})\mathbf{T} = \mathbf{T'IT} = \mathbf{T'T} = \mathbf{I}$$

and

$$\text{cov}(\mathbf{f^*, \boldsymbol{\epsilon}}) = \text{cov}(\mathbf{T'f, \boldsymbol{\epsilon}}) = \mathbf{T'}\text{cov}(\mathbf{f, \boldsymbol{\epsilon}}) = \mathbf{T'0} = \mathbf{0}$$

So f* satisfies all of the assumptions, and hence f* is an equally valid collection of common factors.  There is a certain apparent ambiguity to these models. This ambiguity is later used to justify a factor rotation to obtain a more parsimonious description of the data.

# 12.3 - Principal Component Method

12.3 - Principal Component Method

We consider two different methods to estimate the parameters of a factor model:

• Principal Component Method
• Maximum Likelihood Estimation

A third method, principal factor method, is also available but not considered in this class.

## Principal Component Method

Let $$X_i$$ be a vector of observations for the $$i^{th}$$ subject:

$$\mathbf{X_i} = \left(\begin{array}{c}X_{i1}\\ X_{i2}\\ \vdots \\ X_{ip}\end{array}\right)$$

$$\mathbf{S}$$ denotes our sample variance-covariance matrix and is expressed as:

$$\textbf{S} = \dfrac{1}{n-1}\sum\limits_{i=1}^{n}\mathbf{(X_i - \bar{x})(X_i - \bar{x})'}$$

We have p eigenvalues for this variance-covariance matrix as well as corresponding eigenvectors for this matrix.

Eigenvalues of $$\mathbf{S}$$:

$$\hat{\lambda}_1, \hat{\lambda}_2, \dots, \hat{\lambda}_p$$

Eigenvectors of $$\mathbf{S}$$:

$$\hat{\mathbf{e}}_1, \hat{\mathbf{e}}_2, \dots, \hat{\mathbf{e}}_p$$

Recall that the variance-covariance matrix can be re-expressed in the following form as a function of the eigenvalues and the eigenvectors:

## Spectral Decomposition of $$Σ$$

$$\Sigma = \sum_{i=1}^{p}\lambda_i \mathbf{e_{ie'_i}} \cong \sum_{i=1}^{m}\lambda_i \mathbf{e_{ie'_i}} = \left(\begin{array}{cccc}\sqrt{\lambda_1}\mathbf{e_1} & \sqrt{\lambda_2}\mathbf{e_2} & \dots & \sqrt{\lambda_m}\mathbf{e_m}\end{array}\right) \left(\begin{array}{c}\sqrt{\lambda_1}\mathbf{e'_1}\\ \sqrt{\lambda_2}\mathbf{e'_2}\\ \vdots\\ \sqrt{\lambda_m}\mathbf{e'_m}\end{array}\right) = \mathbf{LL'}$$

The idea behind the principal component method is to approximate this expression. Instead of summing from 1 to p, we now sum from 1 to m, ignoring the last p - m terms in the sum, and obtain the third expression. We can rewrite this as shown in the fourth expression, which is used to define the matrix of factor loadings $$\mathbf{L}$$, yielding the final expression in matrix notation.

Note! If standardized measurements are used, we replace $$\mathbf{S}$$ with the sample correlation matrix $$\mathbf{R}$$.

$$\hat{l}_{ij} = \hat{e}_{ji}\sqrt{\hat{\lambda}_j}$$

This forms the matrix $$\mathbf{L}$$ of factor loading in the factor analysis. This is followed by the transpose of $$\mathbf{L}$$.  To estimate the specific variances, recall that our factor model for the variance-covariance matrix is

$$\boldsymbol{\Sigma} = \mathbf{LL'} + \boldsymbol{\Psi}$$

in matrix notation. $$\Psi$$ is now going to be equal to the variance-covariance matrix minus $$\mathbf{LL'}$$.

$$\boldsymbol{\Psi} = \boldsymbol{\Sigma} - \mathbf{LL'}$$

This in turn suggests that the specific variances, the diagonal elements of $$\Psi$$, are estimated with this expression:

$$\hat{\Psi}_i = s^2_i - \sum\limits_{j=1}^{m}\lambda_j \hat{e}^2_{ji}$$

We take the sample variance for the ith variable and subtract the sum of the squared factor loadings (i.e., the communality).

# 12.4 - Example: Places Rated Data - Principal Component Method

12.4 - Example: Places Rated Data - Principal Component Method

## Example 12-1: Places Rated

Let's revisit the Places Rated Example from Lesson 11.  Recall that the Places Rated Almanac (Boyer and Savageau) rates 329 communities according to nine criteria:

1. Climate and Terrain
2. Housing
3. Health Care & Environment
4. Crime
5. Transportation
6. Education
7. The Arts
8. Recreation
9. Economic

Except for housing and crime, the higher the score the better.For housing and crime, the lower the score the better.

Our objective here is to describe the relationships among the variables.

Before carrying out a factor analysis we need to determine m. How many common factors should be included in the model? This requires a determination of how may parameters will be involved.

For p = 9, the variance-covariance matrix $$\Sigma$$ contains

$$\dfrac{p(p+1)}{2} = \dfrac{9 \times 10}{2} = 45$$

unique elements or entries. For a factor analysis with m factors, the number of parameters in the factor model is equal to

$$p(m+1) = 9(m+1)$$

Taking m = 4, we have 45 parameters in the factor model, this is equal to the number of original parameters, This would result in no dimension reduction. So in this case, we will select m = 3, yielding 36 parameters in the factor model and thus a dimension reduction in our analysis.

It is also common to look at the results of the principal components analysis. The output from Lesson 11.6 is below. The first three components explain 62% of the variation. We consider this to be sufficient for the current example and will base future analyses on three components.

 Component Eigenvalue Proportion Cumulative 1 3.2978 0.3664 0.3664 2 1.2136 0.1348 0.5013 3 1.1055 0.1228 0.6241 4 0.9073 0.1008 0.7249 5 0.8606 0.0956 0.8205 6 0.5622 0.0625 0.8830 7 0.4838 0.0538 0.9368 8 0.3181 0.0353 0.9721 9 0.2511 0.0279 1.0000

We need to select m so that a sufficient amount of variation in the data is explained. What is sufficient is, of course, subjective and depends on the example at hand.

Alternatively, often in social sciences, the underlying theory within the field of study indicates how many factors to expect. In psychology, for example, a circumplex model suggests that mood has two factors: positive affect and arousal. So a two-factor model may be considered for questionnaire data regarding the subjects' moods. In many respects, this is a better approach because then you are letting the science drive the statistics rather than the statistics drive the science! If you can, use your or a field expert's scientific understanding to determine how many factors should be included in your model.

#### Using SAS

The factor analysis is carried out using the program as shown below:

View the video explanation of the SAS code.

#### Using Minitab

View the video below to see how to perform a factor analysis using the Minitab statistical software application.

$$\hat{e}_{i}\sqrt{ \hat{\lambda}_{i}}$$

These are summarized in the table below. The factor loadings are only recorded for the first three factors because we set m=3. We should also note that the factor loadings are the correlations between the factors and the variables. For example, the correlation between the Arts and the first factor is about 0.86. Similarly, the correlation between climate and that factor is only about 0.28.

 Factor Variable 1 2 3 Climate 0.286 0.076 0.841 Housing 0.698 0.153 0.084 Health 0.744 -0.410 -0.020 Crime 0.471 0.522 0.135 Transportation 0.681 -0.156 -0.148 Education 0.498 -0.498 -0.253 Arts 0.861 -0.115 0.011 Recreation 0.642 0.322 0.044 Economics 0.298 0.595 -0.533

Interpreting factor loadings is similar to interpreting the coefficients for principal component analysis. We want to determine some inclusion criterion, which in many instances, may be somewhat arbitrary. In the above table, the values that we consider large are in boldface, using about .5 as the cutoff. The following statements are based on this criterion:

1. Factor 1 is correlated most strongly with Arts (0.861) and also correlated with Health, Housing, Recreation, and to a lesser extent Crime and Education. You can say that the first factor is primarily a measure of these variables.

2. Similarly, Factor 2 is correlated most strongly with Crime, Education, and Economics. You can say that the second factor is primarily a measure of these variables.

3. Likewise, Factor 3 is correlated most strongly with Climate and Economics. You can say that the first factor is primarily a measure of these variables.

The interpretation above is very similar to that obtained in the standardized principal component analysis.

# 12.5 - Communalities

12.5 - Communalities

## Example 12-1: Continued...

The communalities for the $$i^{th}$$ variable are computed by taking the sum of the squared loadings for that variable. This is expressed below:

$$\hat{h}^2_i = \sum\limits_{j=1}^{m}\hat{l}^2_{ij}$$

To understand the computation of communulaties, recall the table of factor loadings:

 Factor Variable 1 2 3 Climate 0.286 0.076 0.841 Housing 0.698 0.153 0.084 Health 0.744 -0.410 -0.020 Crime 0.471 0.522 0.135 Transportation 0.681 -0.156 -0.148 Education 0.498 -0.498 -0.253 Arts 0.861 -0.115 0.011 Recreation 0.642 0.322 0.044 Economics 0.298 0.595 -0.533

Let's compute the communality for Climate, the first variable. We square the factor loadings for climate (given in bold-face in the table above), then add the results:

$$\hat{h}^2_1 = 0.28682^2 + 0.07560^2 + 0.84085^2 = 0.7950$$

#### Using SAS

The communalities of the 9 variables can be obtained from page 4 of the SAS output as shown below:

Final Communality Estimates: Total = 5.616885
Climate housing health crime trans educate arts recreate econ
0.79500707 0.51783185 0.72230182 0.51244913 0.50977159 0.56073895 0.75382091 0.51725940 0.72770402

5.616885, (located just above the individual communalities), is the "Total Communality".

#### Using Minitab

View the video below to see hhow to get the communalities using the Minitab statistical software application.

>

In summary, the communalities are placed into a table:

 Variable Communality Climate 0.795 Housing 0.518 Health 0.722 Crime 0.512 Transportation 0.510 Education 0.561 Arts 0.754 Recreation 0.517 Economics 0.728 Total 5.617

You can think of these values as multiple $$R^{2}$$ values for regression models predicting the variables of interest from the 3 factors. The communality for a given variable can be interpreted as the proportion of variation in that variable explained by the three factors. In other words, if we perform multiple regression of climate against the three common factors, we obtain an $$R^{2} = 0.795$$, indicating that about 79% of the variation in climate is explained by the factor model. The results suggest that the factor analysis does the best job of explaining variation in climate, the arts, economics, and health.

One assessment of how well this model performs can be obtained from the communalities.  We want to see values that are close to one. This indicates that the model explains most of the variation for those variables. In this case, the model does better for some variables than it does for others. The model explains Climate the best, and is not bad for other variables such as Economics, Health and the Arts. However, for other variables such as Crime, Recreation, Transportation and Housing the model does not do a good job, explaining only about half of the variation.

The sum of all communality values is the total communality value:

$$\sum\limits_{i=1}^{p}\hat{h}^2_i = \sum\limits_{i=1}^{m}\hat{\lambda}_i$$

Here, the total communality is 5.617. The proportion of the total variation explained by the three factors is

$$\dfrac{5.617}{9} = 0.624$$

This is the percentage of variation explained in our model. This could be considered an overall assessment of the performance of the model. However, this percentage is the same as the proportion of variation explained by the first three eigenvalues, obtained earlier. The individual communalities tell how well the model is working for the individual variables, and the total communality gives an overall assessment of performance. These are two different assessments.

Because the data are standardized, the variance for the standardized data is equal to one. The specific variances are computed by subtracting the communality from the variance as expressed below:

$$\hat{\Psi}_i = 1-\hat{h}^2_i$$

Recall that the data were standardized before analysis, so the variances of the standardized variables are all equal to one. For example, the specific variance for Climate is computed as follows:

$$\hat{\Psi}_1 = 1-0.795 = 0.205$$

The specific variances are found in the SAS output as the diagonal elements in the table on page 5 as seen below:

Residual Correlation with Uniqueness on the Diagonal

Climate Housing Health crime Trans Educate Arts Recreate Econ
Climate 0.20499 -0.00924 -0.01476 -0.06027 -0.03720 0.18537 -0.07518 -0.12475 0.21735
Housing -0.00924 0.48217 -0.02317 -0.28063 -0.12119 -0.04803 -0.07518 -0.04032 0.04249
Health -0.01476 -0.02317 0.27770 0.05007 -0.15480 -0.11537 -0.00929 -0.09108 0.06527
Crime -0.06027 -0.28063 0.05007 0.48755 0.05497 0.11562 0.00009 -0.18377 -0.10288
Trans -0.03720 -0.12119 -0.15480 0.05497 0.49023 -0.14318 -0.05439 0.01041 -0.12641
Educate 0.18537 -0.04803 -0.11537 0.11562 -0.14318 0.43926 -0.13515 -0.05531 0.14197
Arts -0.07518 -0.07552 -0.00929 0.00009 -0.05439 -0.13515 0.24618 -0.01926 -0.04687
Recreate -0.12475 -0.04032 -0.09108 -0.18377 0.01041 -0.05531 -0.01926 0.48274 -0.18326
Econ 0.21735 0.04249 0.06527 -0.10288 -0.12641 0.14197 -0.04687 -0.18326 0.27230

For example, the specific variance for housing is 0.482.

This model provides an approximation to the correlation matrix.  We can assess the model's appropriateness with the residuals obtained from the following calculation:

$$s_{ij}- \sum\limits_{k=1}^{m}l_{ik}l_{jk}; i \ne j = 1, 2, \dots, p$$

This is basically the difference between R and LL', or the correlation between variables i and j minus the expected value under the model. Generally, these residuals should be as close to zero as possible. For example, the residual between Housing and Climate is -0.00924 which is pretty close to zero. However, there are some that are not very good. The residual between Climate and Economy is 0.217.  These values give an indication of how well the factor model fits the data.

One disadvantage of the principal component method is that it does not provide a test for lack-of-fit. We can examine these numbers and determine if we think they are small or close to zero, but we really do not have a test for this.  Such a test is available for the maximum likelihood method.

# 12.6 - Final Notes about the Principal Component Method

12.6 - Final Notes about the Principal Component Method

Unlike the competing methods, the estimated factor loadings under the principal component method do not change as the number of factors is increased. This is not true of the remaining methods (e.g., maximum likelihood). However, the communalities and the specific variances will depend on the number of factors in the model. In general, as you increase the number of factors, the communalities increase towards one and the specific variances will decrease towards zero.

The diagonal elements of the variance-covariance matrix $$\mathbf{S}$$ (or $$\mathbf{R}$$) are equal to the diagonal elements of the model:

$$\mathbf{\hat{L}\hat{L}' + \mathbf{\hat{\Psi}}}$$

The off-diagonal elements are not exactly reproduced. This is in part due to variability in the data - just random chance. Therefore, we want to select the number of factors to make the off-diagonal elements of the residual matrix small:

$$\mathbf{S - (\hat{L}\hat{L}' + \hat{\Psi})}$$

Here, we have a trade-off between two conflicting desires. For a parsimonious model, we wish to select the number of factors m to be as small as possible, but for such a model, the residuals could be large. Conversely, by selecting m to be large, we may reduce the sizes of the residuals but at the cost of producing a more complex and less interpretable model (there are more factors to interpret).

Another result to note is that the sum of the squared elements of the residual matrix is equal to sum of the squared values of the eigenvalues left out of the matrix:

$$\sum\limits_{j=m+1}^{p}\hat{\lambda}^2_j$$

### General Methods used in determining the number of Factors

Below are three common techniques used to determine the number of factors to extract:

1. Cumulative proportion of at least 0.80 (or 80% explained variance)
2. Eigenvalues of at least one
3. Scree plot based on the "elbow" of the plot; that is, where the plot turns and begins to flatten out

# 12.7 - Maximum Likelihood Estimation Method

12.7 - Maximum Likelihood Estimation Method

## Assumption

Maximum Likelihood Estimation requires that the data are sampled from a multivariate normal distribution. This is a drawback of this method. Data is often collected on a Likert scale, especially in the social sciences. Because a Likert scale is discrete and bounded, these data cannot be normally distributed.

Using the Maximum Likelihood Estimation Method, we must assume that the data are independently sampled from a multivariate normal distribution with mean vector $$\mu$$ and variance-covariance matrix of the form:

$$\boldsymbol{\Sigma} = \mathbf{LL' +\boldsymbol{\Psi}}$$

where $$\mathbf{L}$$ is the matrix of factor loadings and $$\Psi$$ is the diagonal matrix of specific variances.

We define additional notation: As usual, the data vectors for n subjects are represented as shown:

$$\mathbf{X_1},\mathbf{X_2}, \dots, \mathbf{X_n}$$

Maximum likelihood estimation involves estimating the mean, the matrix of factor loadings, and the specific variance.

The maximum likelihood estimator for the mean vector $$\mu$$, the factor loadings $$\mathbf{L}$$, and the specific variances $$\Psi$$ are obtained by finding $$\hat{\mathbf{\mu}}$$, $$\hat{\mathbf{L}}$$, and $$\hat{\mathbf{\Psi}}$$ that maximize the log likelihood given by the following expression:

$$l(\mathbf{\mu, L, \Psi}) = - \dfrac{np}{2}\log{2\pi}- \dfrac{n}{2}\log{|\mathbf{LL' + \Psi}|} - \dfrac{1}{2}\sum_{i=1}^{n}\mathbf{(X_i-\mu)'(LL'+\Psi)^{-1}(X_i-\mu)}$$

The log of the joint probability distribution of the data is maximized. We want to find the values of the parameters, ($$\mu$$, $$\mathbf{L}$$, and $$\Psi$$), that is most compatible with what we see in the data. As was noted earlier the solutions for these factor models are not unique. Equivalent models can be obtained by rotation. If $$\mathbf{L'\Psi^{-1}L}$$ is a diagonal matrix, then we may obtain a unique solution.

Computationally this process is complex. In general, there is no closed-form solution to this maximization problem so iterative methods are applied. Implementation of iterative methods can run into problems as we will see later.

# 12.8 - Example: Places Rated Data

12.8 - Example: Places Rated Data

## Example 12-2: Places Rated

#### Using SAS

This method of factor analysis is being carried out using the program shown below:

Here we have specified the Maximum Likelihood Method by setting method=ml. Again, we need to specify the number of factors.

You will notice that this program produces errors and does not complete the factor analysis. We will start out without the Heywood or priors options discussed below to see the error that occurs and how to remedy it.

For m = 3 factors, maximum likelihood estimation fails to converge.  An examination of the records of each iteration reveals that the commonality of the first variable (climate) exceeds one during the first iteration.  Because the communality must lie between 0 and 1, this is the cause for failure.

SAS provides a number of different fixes for this kind of error.  Most fixes adjust the initial guess, or starting value, for the commonalities.

#### Remedies

1. Attempt adding a priors option to the procedure to set an initial guess for the commonalities. Available priors include:
• priors=smc: Sets the prior commonality of each variable proportional to the R2 of that variable with all other variables as an initial guess.
• priors=asmc: As above with an adjustment so that the sum of the commonalities is equal to the maximum of the absolute correlations.
• priors=max: Sets the prior commonality of each variable to the maximum absolute correlation within any other variable.
• priors=random: Sets the prior commonality of each variable to a random number between 0 and 1.

This option is added within the proc factor line of code (proc factor method=ml nfactors=3 priors=smc;).  If we begin with better starting values, then we might have better luck at convergence. Unfortunately, in trying each of these options, (including running the random option multiple times), we find that these options are ineffective for our Places Rated Data. The second option needs to be considered.

2. Attempt adding the Heywood option to the procedure (proc factor method=ml nfactors=3 heywood;). This sets communalities greater than one back to one, allowing iterations to proceed. In other words, if the commonality value falls out of bounds, then it will be replaced by a value of one. This will always yield a solution, but frequently the solution will not adequately fit the data.

We start with the same values for the commonalities and then at each iteration we obtain new values for the commonalities. The criterion is a value that we are trying to minimize in order to obtain our estimates. We can see that the convergence criterion decreases with each iteration of the algorithm.

Iteration Criterion Ridge Change Communalities
1 0.3291161 0.0000 0.2734 0.47254 0.40913 0.73500 0.22107
0.38516 0.26178 0.75125 0.46384
0.15271
2 0.2946707 0.0000 0.5275 1.00000 0.37872 0.75101 0.20469
0.36111 0.26155 0.75298 0.48979
0.11995
3 0.2877116 0.0000 0.0577 1.00000 0.41243 0.80868 0.22168
0.38551 0.26263 0.74546 0.53277
0.11601
4 0.2876330 0.0000 0.0055 1.00000 0.41336 0.81414 0.21647
0.38365 0.26471 0.74493 0.53724
0.11496
5 0.2876314 0.0000 0.0007 1.00000 0.41392 0.81466 0.21595
0.38346 0.26475 0.74458 0.53794
0.11442

You can see in the second iteration that rather than report a commonality greater than one, SAS replaces it with the value one and then proceeds as usual through the iterations.

After five iterations the algorithm converges, as indicated by the statement on the second page of the output.  The algorithm converged to a setting where the commonality for Climate is equal to one.

#### Using Minitab

Click on the video below to see how to perform a factor analysis with maximum likelihood using the Minitab statistical software application.

# 12.9 - Goodness-of-Fit

12.9 - Goodness-of-Fit

Before we proceed, we would like to determine if the model adequately fits the data. The goodness-of-fit test in this case compares the variance-covariance matrix under a parsimonious model to the variance-covariance matrix without any restriction, i.e. under the assumption that the variances and covariances can take any values. The variance-covariance matrix under the assumed model can be expressed as:

$$\mathbf{\Sigma = LL' + \Psi}$$

$$\mathbf{L}$$ is the matrix of factor loadings, and the diagonal elements of $$Ψ$$ are equal to the specific variances. This is a very specific structure for the variance-covariance matrix. A more general structure would allow those elements to take any value. To assess goodness-of-fit, we use the Bartlett-Corrected Likelihood Ratio Test Statistic:

$$X^2 = \left(n-1-\frac{2p+4m-5}{6}\right)\log \frac{|\mathbf{\hat{L}\hat{L}'}+\mathbf{\hat{\Psi}}|}{|\hat{\mathbf{\Sigma}}|}$$

The test is a likelihood ratio test, where two likelihoods are compared, one under the parsimonious model and the other without any restrictions. The constant is the statistic is called the Bartlett correction. The log is the natural log. In the numerator we have the determinant of the fitted factor model for the variance-covariance matrix, and below, we have a sample estimate of the variance-covariance matrix assuming no structure where:

$$\hat{\boldsymbol{\Sigma}} = \frac{n-1}{n}\mathbf{S}$$

and $$\mathbf{S}$$ is the sample variance-covariance matrix. This is just another estimate of the variance-covariance matrix which includes a small bias. If the factor model fits well then these two determinants should be about the same and you will get a small value for $$X_{2}$$. However, if the model does not fit well, then the determinants will be difference and $$X_{2}$$ will be large.

Under the null hypothesis that the factor model adequately describes the relationships among the variables,

$$\mathbf{X}^2 \sim \chi^2_{\frac{(p-m)^2-p-m}{2}}$$

Under the null hypothesis, that the factor model adequately describes the data, this test statistic has a chi-square distribution with an unusual set of degrees of freedom as shown above. The degrees of freedom are the difference in the number of unique parameters in the two models. We reject the null hypothesis that the factor model adequately describes the data if $$X_{2}$$ exceeds the critical value from the chi-square table.

Back to the Output... Looking just past the iteration results, we have...

Significance Tests based on 329 Observations

Test DF Chi-Square Pr > ChiSq
$$H_{o}\colon$$ No common factors 36 839.4268 < 0.0001
$$H_{A}\colon$$ At least one common factor
$$H_{o}\colon$$ 3 Factors are sufficient 12 92.6652 < 0.0001
$$H_{A}\colon$$ More Factors are needed

For our Places Rated dataset, we find a significant lack of fit. $$X _ { 2 } = 92.67 ; d . f = 12 ; p < 0.0001$$. We conclude that the relationships among the variables is not adequately described by the factor model. This suggests that we do not have the correct model.

The only remedy that we can apply in this case is to increase the number m of factors until an adequate fit is achieved. Note, however, that m must satisfy

$$p(m+1) \le \frac{p(p+1)}{2}$$

In the present example, this means that m ≤ 4.

Let's return to the SAS program and change the "nfactors" value from 3 to 4:

Significance Tests based on 329 Observations

Test DF Chi-Square Pr > ChiSq
$$H_{o}\colon$$ No common factors 36 839.4268 < 0.0001
$$H_{A}\colon$$ At least one common factor
$$H_{o}\colon$$ 4 Factors are sufficient 6 41.6867 < 0.0001
$$H_{A}\colon$$ More Factors are needed

We find that the factor model with m = 4 does not fit the data adequately either, $$X _ { 2 } = 41.69 ; d . f . = 6 ; p < 0.0001$$. We cannot properly fit a factor model to describe this particular data and conclude that a factor model does not work with this particular dataset. There is something else going on here, perhaps some non-linearity. Whatever the case, it does not look like this yields a good-fitting factor model. A next step could be to drop variables from the data set to obtain a better fitting model.

# 12.10 - Factor Rotations

12.10 - Factor Rotations

From our experience with the Places Rated data, it does not look like the factor model works well. There is no guarantee that any model will fit the data well.

The first motivation of factor analysis was to try to discern some underlying factors describing the data. The Maximum Likelihood Method failed to find such a model to describe the Places Rated data. The second motivation is still valid, that is to try to obtain a better interpretation of the data. In order to do this, let's take a look at the factor loadings obtained before from the principal component method.

 Factor Variable 1 2 3 Climate 0.286 0.076 0.841 Housing 0.698 0.153 0.084 Health 0.744 -0.410 -0.020 Crime 0.471 0.522 0.135 Transportation 0.681 -0.156 -0.148 Education 0.498 -0.498 -0.253 Arts 0.861 -0.115 0.011 Recreation 0.642 0.322 0.044 Economics 0.298 0.595 -0.533

The problem with this analysis is that some of the variables are highlighted in more than one column. For instance, Education appears significant to Factor 1 AND Factor 2. The same is true for Economics in both Factors 2 AND 3. This does not provide a very clean, simple interpretation of the data. Ideally, each variable would appear as a significant contributer in one column.

In fact the above table may indicate contradictory results. Looking at some of the observations, it is conceivable that we will find an observation that takes a high value on both Factors 1 and 2. If this occurs, a high value for Factor 1 suggests that the community has quality education, whereas the high value for Factor 2 suggests the opposite, that the community has poor education.

Factor rotation is motivated by the fact that factor models are not unique. Recall that the factor model for the data vector, $$\mathbf{X = \boldsymbol{\mu} + LF + \boldsymbol{\epsilon}}$$, is a function of the mean $$\boldsymbol{\mu}$$, plus a matrix of factor loadings times a vector of common factors, plus a vector of specific factors.

Moreover, we should note that this is equivalent to a rotated factor model, $$\mathbf{X = \boldsymbol{\mu} + L^*F^* + \boldsymbol{\epsilon}}$$, where we have set $$\mathbf{L^* = LT}$$ and $$\mathbf{f^* = T'f}$$ for some orthogonal matrix $$\mathbf{T}$$ where $$\mathbf{T'T = TT' = I}$$. Note that there are an infinite number of possible orthogonal matrices, each corresponding to a particular factor rotation.

We plan to find an appropriate rotation, defined through an orthogonal matrix $$\mathbf{T}$$, that yields the most easily interpretable factors.

To understand this, consider a scatter plot of factor loadings. The orthogonal matrix $$\mathbf{T}$$ rotates the axes of this plot. We wish to find a rotation such that each of the p variables has a high loading on only one factor.

#### Using SAS

We will return to the program below to obtain a plot.  In looking at the program, there are a number of options (marked in blue under proc factor) that we did not yet explain.

One of the options above is labeled 'preplot'. We will use this to plot the values for factor 1 against factor 2.

In the output these values are plotted, the loadings for factor 1 on the y-axis, and the loadings for factor 2 on the x-axis. Similarly, the second variable, labeled with the letter B, has a factor 1 loading of about 0.7 and a factor 2 loading of about 0.15.  Each letter on the plot corresponds to a single variable. SAS provides plots of the other combinations of factors, factor 1 against factor 3 as well as factor 2 against factor 3.

Three factors appear in this model so what we might consider a three-dimensional plot of all three factors together.

#### Using Minitab

To obtain the scree plot and the loading plot using the Minitab statistical software application.

The selection of the orthogonal matrixes $$\mathbf{T}$$ corresponds to our rotation of these axis. Think about rotating the axis about the center. Each rotation will correspond to an orthogonal matrix $$\mathbf{T}$$. We want to rotate the axes to obtain a cleaner interpretation of the data. We would really like to define new coordinate systems so that when we rotate everything, the points fall close to the vertices (end points) of the new axes.

If we were only looking at two factors, then we would like to find each of the plotted points at the four tips (corresponding to all four directions) of the rotated axes. This is what rotation is about, taking the factor pattern plot and rotating the axes in such a way so that the points fall close to the axes.

# 12.11 - Varimax Rotation

12.11 - Varimax Rotation
Varimax Rotation
Varimax rotation is the most common. It involves scaling the loadings by dividing them by the corresponding communality as shown below:
$$\tilde{l}^*_{ij}= \hat{l}^*_{ij}/\hat{h}_i$$
Varimax rotation finds the rotation that maximizes this quantity. The Varimax procedure, as defined below, selects the rotation in order to maximize
$$V = \frac{1}{p}\sum\limits_{j=1}^{m}\left\{\sum\limits_{i=1}^{p}(\tilde{l}^*_{ij})^4 - \frac{1}{p}\left(\sum\limits_{i=1}^{p}(\tilde{l}^*_{ij})^2 \right)^2 \right\}$$

This is the sample variances of the standardized loadings for each factor summed over the m factors.

#### Using SAS

Returning to the options of the factor procedure (marked in blue):

"plot" asks for the same kind of plot that we just looked at for the rotated factors. The result of our rotation is a new factor pattern given below (page 11 of SAS output): Here is a copy of page 10 from the SAS output: At the top of page 10 of the output, above, we have our orthogonal matrix T .

#### Using Minitab

View the video below to see how to use the Varimax rotation using the Minitab statistical software application.

 Factor Variable 1 2 3 Climate 0.021 0.239 0.859 Housing 0.438 0.547 0.166 Health 0.829 0.127 0.137 Crime 0.031 0.702 0.139 Transportation 0.652 0.289 -0.028 Education 0.734 -0.094 -0.117 Arts 0.738 0.432 0.150 Recreation 0.301 0.656 0.099 Economics -0.022 0.651 -0.551

Let us now interpret the data based on the rotation. We highlighted the values that are large in magnitude and make the following interpretation.

Note! The interpretation is much cleaner than that of the original analysis.
• Factor 1: primarily a measure of Health, but also increases with increasing scores for Transportation, Education, and the Arts.
• Factor 2: primarily a measure of Crime, Recreation, the Economy, and Housing.
• Factor 3: primarily a measure of Climate alone.

This is just the pattern that exists in the data and no causal inferences should be made from this interpretation. It does not tell us why this pattern exists. It could very well be that there are other essential factors that are not seen at work here.

Let us look at the amount of variation explained by our factors under the rotated model and compare it to the original model. Consider the variance explained by each factor under the original analysis and the rotated factors:

 Analysis Factor Original Rotated 1 3.2978 2.4798 2 1.2136 1.9835 3 1.1055 1.1536 Total 5.6169 5.6169

The total amount of variation explained by the 3 factors remains the same. Rotations, among a fixed number of factors, do not change how much of the variation is explained by the model. The fit is equally good regardless of what rotation is used.

However, notice what happened to the first factor. We see a fairly large decrease in the amount of variation explained by the first factor. We obtained a cleaner interpretation of the data but it costs us something somewhere. The cost is that the variation explained by the first factor is distributed among the latter two factors, in this case mostly to the second factor.

The total amount of variation explained by the rotated factor model is the same, but the contributions are not the same from the individual factors. We gain a cleaner interpretation, but the first factor does not explain as much of the variation. However, this would not be considered a particularly large cost if we are still interested in these three factors.

Rotation cleans up the interpretation. Ideally, we should find that the numbers in each column are either far away from zero or close to zero. Numbers close to +1 or -1 or 0 in each column give the ideal or cleanest interpretation. If a rotation can achieve this goal, then that is wonderful. However, observed data are seldom this cooperative!

Nevertheless, recall that the objective is data interpretation. The success of the analysis can be judged by how well it helps you to make sense of your data If the result gives you some insight as to the pattern of variability in the data, even without being perfect, then the analysis was successful.

# 12.12 - Estimation of Factor Scores

12.12 - Estimation of Factor Scores

Factor scores are similar to the principal components in the previous lesson. Just we plotted principal components against each other, a similar scatter plot of factor scores is also helpful. We also might use factor scores as explanatory variables in future analyses. It may even be of interest to use the factor score as the dependent variable in a future analysis.

The methods for estimating factor scores depend on the method used to carry out the principal components analysis. The vectors of common factors f is of interest. There are m unobserved factors in our model and we would like to estimate those factors. Therefore, given the factor model:

$$\mathbf{Y_i = \boldsymbol{\mu} + Lf_i + \boldsymbol{\epsilon_i}}; i = 1,2,\dots, n,$$

we may wish to estimate the vectors of factor scores

$$\mathbf{f_1, f_2, \dots, f_n}$$

for each observation.

## Methods

There are a number of different methods for estimating factor scores from the data. These include:

• Ordinary Least Squares
• Weighted Least Squares
• Regression method

## Ordinary Least Squares

By default, this is the method that SAS uses if you use the principal component method. The difference between the $$j^{th}$$ variable on the $$i^{th}$$ subject and its value under the factor model is computed. The $$\mathbf{L}$$'s are factor loadings and the f's are the unobserved common factors. The vector of common factors for subject i, or $$\hat{\mathbf{f}}_i$$, is found by minimizing the sum of the squared residuals:

$\sum_{j=1}^{p}\epsilon^2_{ij} = \sum_{j=1}^{p}(y_{ij}-\mu_j-l_{j1}f_1 - l_{j2}f_2 - \dots - l_{jm}f_m)^2 = (\mathbf{Y_i - \boldsymbol{\mu} - Lf_i})'(\mathbf{Y_i - \boldsymbol{\mu} - Lf_i})$

This is like a least squares regression, except in this case we already have estimates of the parameters (the factor loadings), but wish to estimate the explanatory common factors. In matrix notation the solution is expressed as:

$$\mathbf{\hat{f}_i = (L'L)^{-1}L'(Y_i-\boldsymbol{\mu})}$$

In practice, we substitute in our estimated factor loadings into this expression as well as the sample mean for the data:

$$\mathbf{\hat{f}_i = \left(\hat{L}'\hat{L}\right)^{-1}\hat{L}'(Y_i-\bar{y})}$$

Using the principal component method with the unrotated factor loadings, this yields:

$\mathbf{\hat{f}_i} = \left(\begin{array}{c} \frac{1}{\sqrt{\hat{\lambda}_1}}\mathbf{\hat{e}'_1(Y_i-\bar{y})}\\ \frac{1}{\sqrt{\hat{\lambda}_2}}\mathbf{\hat{e}'_2(Y_i-\bar{y})}\\ \vdots \\ \frac{1}{\sqrt{\hat{\lambda}_m}}\mathbf{\hat{e}'_m(Y_i-\bar{y})}\end{array}\right)$

$$e_i$$ through $$e_m$$ are our first m eigenvectors.

## Weighted Least Squares (Bartlett)

The difference between WLS and OLS is that the squared residuals are divided by the specific variances as shown below. This is going to give more weight, in this estimation, to variables that have low specific variances.  The factor model fits the data best for variables with low specific variances.  The variables with low specific variances should give us more information regarding the true values for the specific factors.

Therefore, for the factor model:

$$\mathbf{Y_i = \boldsymbol{\mu} + Lf_i + \boldsymbol{\epsilon_i}}$$

we want to find $$\boldsymbol{f_i}$$ that minimizes

$$\sum\limits_{j=1}^{p}\frac{\epsilon^2_{ij}}{\Psi_j} = \sum\limits_{j=1}^{p}\frac{(y_{ij}-\mu_j - l_{j1}f_1 - l_{j2}f_2 -\dots - l_{jm}f_m)^2}{\Psi} = \mathbf{(Y_i-\boldsymbol{\mu}-Lf_i)'\Psi^{-1}(Y_i-\boldsymbol{\mu}-Lf_i)}$$

The solution is given by this expression where $$\mathbf{\Psi}$$ is the diagonal matrix whose diagonal elements are equal to the specific variances:

$$\mathbf{\hat{f}_i = (L'\Psi^{-1}L)^{-1}L'\Psi^{-1}(Y_i-\boldsymbol{\mu})}$$

and can be estimated by substituting in the following:

$$\mathbf{\hat{f}_i = (\hat{L}'\hat{\Psi}^{-1}\hat{L})^{-1}\hat{L}'\hat{\Psi}^{-1}(Y_i-\bar{y})}$$

## Regression Method

This method is used for maximum likelihood estimates of factor loadings. A vector of the observed data, supplemented by the vector of factor loadings for the ith subject, is considered.

The joint distribution of the data $$\boldsymbol{Y}_i$$ and the factor $$\boldsymbol{f}_i$$ is

$$\left(\begin{array}{c}\mathbf{Y_i} \\ \mathbf{f_i}\end{array}\right) \sim N \left[\left(\begin{array}{c}\mathbf{\boldsymbol{\mu}} \\ 0 \end{array}\right), \left(\begin{array}{cc}\mathbf{LL'+\Psi} & \mathbf{L} \\ \mathbf{L'} & \mathbf{I}\end{array}\right)\right]$$

Using this we can calculate the conditional expectation of the common factor score $$\boldsymbol{f}_i$$ given the data $$\boldsymbol{Y}_i$$ as expressed here:

$$E(\mathbf{f_i|Y_i}) = \mathbf{L'(LL'+\Psi)^{-1}(Y_i-\boldsymbol{\mu})}$$

This suggests the following estimator by substituting in the estimates for L and $$\mathbf{\Psi}$$:

$$\mathbf{\hat{f}_i = \hat{L}'\left(\hat{L}\hat{L}'+\hat{\Psi}\right)^{-1}(Y_i-\bar{y})}$$

There is a little bit of a fix that often takes place to reduce the effects of incorrect determination of the number of factors. This tends to give you results that are a bit more stable.

$$\mathbf{\tilde{f}_i = \hat{L}'S^{-1}(Y_i-\bar{y})}$$

# 12.13 - Summary

12.13 - Summary

In this lesson we learned about: