8.4 - Example: Pottery Data - Checking Model Assumptions

Example 8-1 Pottery Data (MANOVA)

Before carrying out a MANOVA, first, check the model assumptions:

The data from group i has common mean vector \(\boldsymbol{\mu}_{i}\)
The data from all groups have a common variance-covariance matrix \(\Sigma\).
Independence: The subjects are independently sampled.
Normality: The data are multivariate normally distributed.

Assumptions

Assumption 1: The data from group i have a common mean vector \(\boldsymbol{\mu}_{i}\)

This assumption says that there are no subpopulations with different mean vectors. Here, this assumption might be violated if pottery collected from the same site had inconsistencies.
Assumption 3: Independence: The subjects are independently sampled. This assumption is satisfied if the assayed pottery is obtained by randomly sampling the pottery collected from each site. This assumption would be violated if, for example, pottery samples were collected in clusters. In other applications, this assumption may be violated if the data were collected over time or space.
Assumption 4: Normality: The data are multivariate normally distributed.

Note!

For large samples, the Central Limit Theorem says that the sample mean vectors are approximately multivariate normally distributed, even if the individual observations are not.
For the pottery data, however, we have a total of only N = 26 observations, including only two samples from Caldicot. With a small N, we cannot rely on the Central Limit Theorem.

Diagnostic procedures are based on the residuals, computed by taking the differences between the individual observations and the group means for each variable:

\(\hat{\epsilon}_{ijk} = Y_{ijk}-\bar{Y}_{i.k}\)

Thus, for each subject (or pottery sample in this case), residuals are defined for each of the p variables. Then, to assess normality, we apply the following graphical procedures:

Plot the histograms of the residuals for each variable. Look for a symmetric distribution.
Plot a matrix of scatter plots. Look for elliptical distributions and outliers.
Plot three-dimensional scatter plots. Look for elliptical distributions and outliers.

If the histograms are not symmetric or the scatter plots are not elliptical, this would be evidence that the data are not sampled from a multivariate normal distribution in violation of Assumption 4. In this case, a normalizing transformation should be considered.

Download the text file containing the data here: pottery.csv

Example
Example

The SAS program below will help us check this assumption.

Download the SAS Program here: potterya.sas

Note: In the upper right-hand corner of the code block you will have the option of copying () the code to your clipboard or downloading () the file to your computer.

options ls=78;
title "Check for normality - Pottery Data";

data pottery;
  infile "D:\Statistics\STAT 505\data\pottery.csv" delimiter=',' firstobs=2;
  input site $ al fe mg ca na;
  run;

 /* The class statement specifies the categorical variable site.
  * The model statement specifies the five responses to the left
  * and the categorical predictor to the right of the = sign.
  * The output statement is optional and can be used to save the
  * residuals, which are named with the r= option for later use.
  */

proc glm data=pottery;
  class site;
  model al fe mg ca na = site;
  output out=resids r=ral rfe rmg rca rna;
  run;

proc print data=resids;
  run;

MANOVA normality assumption

To fit the MANOVA model and assess the normality of residuals in Minitab:

Open the ‘pottery’ data set in a new worksheet
For convenience, rename the columns: site, al, fe, mg, ca, and na from left to right.
Stat > ANOVA > General MANOVA
1. Highlight and select all five variables (al through na) to move them to the Responses window
2. Highlight and select 'site' to move it to the Model window.
3. Graphs > Individual plots, check Histogram and Normal plot, then 'OK'.
4. Choose 'OK' again. The MANOVA table, along with the residual plots are displayed in the results area.

Histograms suggest that, except for sodium, the distributions are relatively symmetric. However, the histogram for sodium suggests that there are two outliers in the data. Both of these outliers are in Llanadyrn.
Two outliers can also be identified from the matrix of scatter plots.
Removal of the two outliers results in a more symmetric distribution for sodium.

The results of MANOVA can be sensitive to the presence of outliers. One approach to assessing this would be to analyze the data twice, once with the outliers and once without them. The results may then be compared for consistency. The following analyses use all of the data, including the two outliers.

Assumption 2: The data from all groups have a common variance-covariance matrix \(\Sigma\).

This assumption can be checked using Box's test for homogeneity of variance-covariance matrices. To obtain Box's test, let \(\Sigma_{i}\) denote the population variance-covariance matrix for group i . Consider testing:

\(H_0\colon \Sigma_1 = \Sigma_2 = \dots = \Sigma_g\)

against

\(H_0\colon \Sigma_i \ne \Sigma_j\) for at least one \(i \ne j\)

Under the alternative hypothesis, at least two of the variance-covariance matrices differ on at least one of their elements. Let:

\(\mathbf{S}_i = \dfrac{1}{n_i-1}\sum\limits_{j=1}^{n_i}\mathbf{(Y_{ij}-\bar{y}_{i.})(Y_{ij}-\bar{y}_{i.})'}\)

denote the sample variance-covariance matrix for group i . Compute the pooled variance-covariance matrix

\(\mathbf{S}_p = \dfrac{\sum_{i=1}^{g}(n_i-1)\mathbf{S}_i}{\sum_{i=1}^{g}(n_i-1)}= \dfrac{\mathbf{E}}{N-g}\)

Box's test is based on the following test statistic:

\(L' = c\left\{(N-g)\log |\mathbf{S}_p| - \sum_{i=1}^{g}(n_i-1)\log|\mathbf{S}_i|\right\}\)

where the correction factor is

\(c = 1-\dfrac{2p^2+3p-1}{6(p+1)(g-1)}\left\{\sum_\limits{i=1}^{g}\dfrac{1}{n_i-1}-\dfrac{1}{N-g}\right\}\)

The version of Box's test considered in the lesson of the two-sample Hotelling's T-square is a special case where g = 2. Under the null hypothesis of homogeneous variance-covariance matrices, L' is approximately chi-square distributed with

\(\dfrac{1}{2}p(p+1)(g-1)\)

degrees of freedom. Reject \(H_0\) at level \(\alpha\) if

\(L' > \chi^2_{\frac{1}{2}p(p+1)(g-1),\alpha}\)

Here we will use the Pottery SAS program.

Download the SAS Program here: pottery2.sas

Note: In the upper right-hand corner of the code block you will have the option of copying () the code to your clipboard or downloading () the file to your computer.

options ls=78;
title "Box's Test - Pottery Data";

data pottery;
  infile "D:\Statistics\STAT 505\data\pottery.csv" delimiter=',' firstobs=2;
  input site $ al fe mg ca na;
  run;

 /* The pool option specifies that covariance matrices should be pooled.
  * The class statement specifies the categorical variable site.
  * The var statement specifies the five variables to be used to 
  * calculate the covariance matrix.
  */

proc discrim data=pottery pool=test;
  class site;
  var al fe mg ca na;
  run;

Minitab does not perform this function at this time.

Analysis

We find no statistically significant evidence against the null hypothesis that the variance-covariance matrices are homogeneous (L' = 27.58; d.f. = 45; p = 0.98).

Notes

If we were to reject the null hypothesis of homogeneity of variance-covariance matrices, then we would conclude that assumption 2 is violated.
MANOVA is not robust to violations of the assumption of homogeneous variance-covariance matrices.
If the variance-covariance matrices are determined to be unequal then the solution is to find a variance-stabilizing transformation.
- Note that the assumptions of homogeneous variance-covariance matrices and multivariate normality are often violated together.
- Therefore, a normalizing transformation may also be a variance-stabilizing transformation.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility