7.2.6 - Model Assumptions and Diagnostics Assumptions

In carrying out any statistical analysis it is always important to consider the assumptions for the analysis and confirm that all assumptions are satisfied.

Let's recall the four assumptions underlying Hotelling's T-square test.

The data from population i is sampled from a population with mean vector \(\boldsymbol{\mu}_{i}\).
The data from both populations have a common variance-covariance matrix \(Σ\)
Independence. The subjects from both populations are independently sampled.
Note! This does not mean that the variables are independent of one another
Normality. Both populations are multivariate normally distributed.

The following will consider each of these assumptions separately, and methods for diagnosing their validity.

Assumption 1: The data from population i is sampled from a population mean vector \(\boldsymbol{\mu}_{i}\).
- This assumption essentially means that there are no subpopulations with a different population mean vectors.
- In our current example, this might be violated if the counterfeit notes were produced by more than one counterfeiter.
- Generally, if you have randomized experiments, this assumption is not of any concern. However, in the current application, we would have to ask the police investigators whether more than one counterfeiter might be present.
Assumption 2: For now we will skip Assumption 2 and return to it at a later time.
Assumption 3: Independence
- Says the subjects for each population were independently sampled. This does not mean that the variables are independent of one another.
- This assumption may be violated for three different reasons:
  - Clustered data: If bank notes are produced in batches, then the data may be clustered. In this case, the notes sampled within a batch may be correlated with one another.
  - Time-series data: If the notes are produced in some order over time, there might be some temporal correlation between notes over time. The notes produced at times close to one another may be more similar. This could result in temporal correlation violating the assumptions of the analysis.
  - Spatial data: If the data were collected over space, we may encounter some spatial correlation.
  Note! the results of Hotelling's T-square are not generally robust to violations of independence.
Assumption 4: Multivariate Normality

To assess this assumption we can produce the following diagnostic procedures:
- Produce histograms for each variable. We should look for a symmetric distribution.
- Produce scatter plots for each pair of variables. Under multivariate normality, we should see an elliptical cloud of points.
- Produce a three-dimensional rotating scatter plot. Again, we should see an elliptical cloud of points.

Note! The Central Limit Theorem implies that the sample mean vectors are going to be approximately multivariate normally distributed regardless of the distribution of the original variables.

So, in general, Hotelling's T-square is not going to be sensitive to violations of this assumption.

Now let us return to assumption 2.

Assumption 2. The data from both populations have a common variance-covariance matrix \(Σ\).

This assumption may be assessed using Box's Test.

Box's Test Section

Suppose that the data from population i have variance-covariance matrix \(\Sigma_i\); for population i = 1, 2. Need to test the null hypothesis that \(\Sigma_1\) is equal to \(\Sigma_2\) against the general alternative that they are not equal as shown below:

\(H_0\colon \Sigma_1 = \Sigma_2\) against \(H_a\colon \Sigma_1 \ne \Sigma_2\)

Here, the alternative is that the variance-covariance matrices differ in at least one of their elements.

The test statistic for Box's Test is given by L-prime as shown below:

\(L' = c\{(n_1+n_2-2)\log{|\mathbf{S}_p|}- (n_1-1)\log{|\mathbf{S}_1|} - (n_2-1)\log{|\mathbf{S}_2|}\}\)

This involves a finite population correction factor c, which is given below.

Note! In this formula, the logs are all-natural logs.

The finite population correction factor, c, is given below:

\(c = 1-\dfrac{2p^2+3p-1}{6(p+1)}\left\{\dfrac{1}{n_1-1}+\dfrac{1}{n_2-1} - \dfrac{1}{n_1+n_2-2}\right\}\)

It is a function of the number of variables p, and the sample sizes \(n_{1}\) and \(n_{2}\).

Under the null hypothesis, \(H_{0}\colon \Sigma_{1} = \Sigma_{2} \), Box's test statistic is approximately chi-square distributed with p(p + 1)/2 degrees of freedom. That is,

\(L' \overset{\cdot}{\sim} \chi^2_{\dfrac{p(p+1)}{2}}\)

The degrees of freedom is equal to the number of unique elements in the variance-covariance matrix (taking into account that this matrix is symmetric). We will reject \(H_o\) at level \(\alpha\) if the test statistic exceeds the critical value from the chi-square table evaluated at level \(\alpha\).

\(L' > \chi^2_{\dfrac{p(p+1)}{2}, \alpha}\)

Example
Example

Box's Test may be carried out using the SAS program as shown below:

Download the SAS program here: swiss15.sas

Note: In the upper right-hand corner of the code block you will have the option of copying () the code to your clipboard or downloading () the file to your computer.

options ls=78;
title "Bartlett's Test - Swiss Bank Notes";

data swiss;
  infile "D:\Statistics\STAT 505\data\swiss3.csv" firstobs=2 delimiter=',';
  input type $ length left right bottom top diag;
  run;

 /* The discrim procedure is called with the pool=test
  * option to produce Bartlett's test for equal 
  * covariance matrices. The remaining parts of the
  * output are not used.
  * The class statement defines the grouping variable,
  * which is the type of note, and all response
  * variables are specified  in the var statement.
  */

proc discrim pool=test;
  class type;
  var length left right bottom top diag;
  run;

The output can be downloaded here: swiss15.lst

At this time Minitab does not support this procedure.

Analysis

Under the null hypothesis that the variance-covariance matrices for the two populations are equal, the natural logs of the determinants of the variance-covariance matrices should be approximately the same for the fake and the real notes.

The results of Box's Test are on the bottom of page two of the output. The test statistic is 121.90 with 21 degrees of freedom; recall that p=6. The p-value for the test is less than 0.0001 indicating that we reject the null hypothesis.

The conclusion here is that the two populations of banknotes have different variance-covariance matrices in at least one of their elements. This is backed up by the evidence given by the test statistic \(\left( L ^ { \prime } = 121.899; \mathrm { d.f. } = 21 ; p < 0.0001 \right)\). Therefore, the assumption of homogeneous variance-covariance matrices is violated.

Notes

One should be aware that, even though Hotelling's T-square test is robust to violations of assumptions of multivariate normality, the results of Box's test are not robust to normality violations. The Box's Test should not be used if there is any indication that the data are not multivariate normally distributed.

In general, the two-sample Hotelling's T-square test is sensitive to violations of the assumption of homogeneity of variance-covariance matrices, this is especially the case when the sample sizes are unequal, i.e., \(n_{1}\) ≠ \(n_{2}\). If the sample sizes are equal then there doesn't tend to be all that much sensitivity and the ordinary two-sample Hotelling's T-square test can be used as usual.