7.2.6  Model Assumptions and Diagnostics Assumptions
7.2.6  Model Assumptions and Diagnostics AssumptionsIn carrying out any statistical analysis it is always important to consider the assumptions for the analysis and confirm that all assumptions are satisfied.
Let's recall the four assumptions underlying Hotelling's Tsquare test.
 The data from population i is sampled from a population with mean vector \(\boldsymbol{\mu}_{i}\).
 The data from both populations have a common variancecovariance matrix \(Σ\)
 Independence. The subjects from both populations are independently sampled.
Note! This does not mean that the variables are independent of one another
 Normality. Both populations are multivariate normally distributed.
The following will consider each of these assumptions separately, and methods for diagnosing their validity.

Assumption 1: The data from population i is sampled from a population mean vector \(\boldsymbol{\mu}_{i}\).
 This assumption essentially means that there are no subpopulations with a different population mean vectors.
 In our current example, this might be violated if the counterfeit notes were produced by more than one counterfeiter.
 Generally, if you have randomized experiments, this assumption is not of any concern. However, in the current application, we would have to ask the police investigators whether more than one counterfeiter might be present.

Assumption 2: For now we will skip Assumption 2 and return to it at a later time.

Assumption 3: Independence
 Says the subjects for each population were independently sampled. This does not mean that the variables are independent of one another.
 This assumption may be violated for three different reasons:
 Clustered data: If bank notes are produced in batches, then the data may be clustered. In this case, the notes sampled within a batch may be correlated with one another.
 Timeseries data: If the notes are produced in some order over time, there might be some temporal correlation between notes over time. The notes produced at times close to one another may be more similar. This could result in temporal correlation violating the assumptions of the analysis.
 Spatial data: If the data were collected over space, we may encounter some spatial correlation.
Note! the results of Hotelling's Tsquare are not generally robust to violations of independence.

Assumption 4: Multivariate Normality
To assess this assumption we can produce the following diagnostic procedures:
 Produce histograms for each variable. We should look for a symmetric distribution.
 Produce scatter plots for each pair of variables. Under multivariate normality, we should see an elliptical cloud of points.
 Produce a threedimensional rotating scatter plot. Again, we should see an elliptical cloud of points.
So, in general, Hotelling's Tsquare is not going to be sensitive to violations of this assumption.
Now let us return to assumption 2.
Assumption 2. The data from both populations have a common variancecovariance matrix \(Σ\).
This assumption may be assessed using Box's Test.
Box's Test
Suppose that the data from population i have variancecovariance matrix \(\Sigma_i\); for population i = 1, 2. Need to test the null hypothesis that \(\Sigma_1\) is equal to \(\Sigma_2\) against the general alternative that they are not equal as shown below:
\(H_0\colon \Sigma_1 = \Sigma_2\) against \(H_a\colon \Sigma_1 \ne \Sigma_2\)
Here, the alternative is that the variancecovariance matrices differ in at least one of their elements.
The test statistic for Box's Test is given by Lprime as shown below:
\(L' = c\{(n_1+n_22)\log{\mathbf{S}_p} (n_11)\log{\mathbf{S}_1}  (n_21)\log{\mathbf{S}_2}\}\)
This involves a finite population correction factor c, which is given below.
The finite population correction factor, c, is given below:
\(c = 1\dfrac{2p^2+3p1}{6(p+1)}\left\{\dfrac{1}{n_11}+\dfrac{1}{n_21}  \dfrac{1}{n_1+n_22}\right\}\)
It is a function of the number of variables p, and the sample sizes \(n_{1}\) and \(n_{2}\).
Under the null hypothesis, \(H_{0}\colon \Sigma_{1} = \Sigma_{2} \), Box's test statistic is approximately chisquare distributed with p(p + 1)/2 degrees of freedom. That is,
\(L' \overset{\cdot}{\sim} \chi^2_{\dfrac{p(p+1)}{2}}\)
The degrees of freedom is equal to the number of unique elements in the variancecovariance matrix (taking into account that this matrix is symmetric). We will reject \(H_o\) at level \(\alpha\) if the test statistic exceeds the critical value from the chisquare table evaluated at level \(\alpha\).
\(L' > \chi^2_{\dfrac{p(p+1)}{2}, \alpha}\)
Box's Test may be carried out using the SAS program as shown below:
Download the SAS program here: swiss15.sas
Note: In the upper righthand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.
options ls=78;
title "Bartlett's Test  Swiss Bank Notes";
data swiss;
infile "D:\Statistics\STAT 505\data\swiss3.csv" firstobs=2 delimeter=',';
input type $ length left right bottom top diag;
run;
/* The discrim procedure is called with the pool=test
* option to produce Bartlett's test for equal
* covariance matrices. The remaining parts of the
* output are not used.
* The class statement defines the grouping variable,
* which is the type of note, and all response
* variables are specified in the var statement.
*/
proc discrim pool=test;
class type;
var length left right bottom top diag;
run;
The output can be downloaded here: swiss15.lst
At this time Minitab does not support this procedure.
Analysis
Under the null hypothesis that the variancecovariance matrices for the two populations are equal, the natural logs of the determinants of the variancecovariance matrices should be approximately the same for the fake and the real notes.
The results of Box's Test are on the bottom of page two of the output. The test statistic is 121.90 with 21 degrees of freedom; recall that p=6. The pvalue for the test is less than 0.0001 indicating that we reject the null hypothesis.
The conclusion here is that the two populations of banknotes have different variancecovariance matrices in at least one of their elements. This is backed up by the evidence given by the test statistic \(\left( L ^ { \prime } = 121.899; \mathrm { d.f. } = 21 ; p < 0.0001 \right)\). Therefore, the assumption of homogeneous variancecovariance matrices is violated.
Notes
One should be aware that, even though Hotelling's Tsquare test is robust to violations of assumptions of multivariate normality, the results of Box's test are not robust to normality violations. The Box's Test should not be used if there is any indication that the data are not multivariate normally distributed.
In general, the twosample Hotelling's Tsquare test is sensitive to violations of the assumption of homogeneity of variancecovariance matrices, this is especially the case when the sample sizes are unequal, i.e., \(n_{1}\) ≠ \(n_{2}\). If the sample sizes are equal then there doesn't tend to be all that much sensitivity and the ordinary twosample Hotelling's Tsquare test can be used as usual.