7.1.15 - The Two-Sample Hotelling's T-Square Test Statistic

7.1.15 - The Two-Sample Hotelling's T-Square Test Statistic

Now we are ready to define the Two-sample Hotelling's T-Square test statistic. As in the expression below, you will note that it involves the computation of differences in the sample mean vectors. It also involves a calculation of the pooled variance-covariance matrix multiplied by the sum of the inverses of the sample size. The resulting matrix is then inverted.

\(T^2 = \mathbf{(\bar{x}_1 - \bar{x}_2)}^T\{\mathbf{S}_p(\frac{1}{n_1}+\frac{1}{n_2})\}^{-1} \mathbf{(\bar{x}_1 - \bar{x}_2)}\)

For large samples, this test statistic will be approximately chi-square distributed with \(p\) degrees of freedom. However, as before this approximation does not take into account the variation due to estimating the variance-covariance matrix. So, as before, we will look at transforming this Hotelling's T-square statistic into an F-statistic using the following expression.

Note! This is a function of the sample sizes of the two populations and the number of variables measured p.

\(F = \dfrac{n_1+n_2-p-1}{p(n_1+n_2-2)}T^2 \sim F_{p, n_1+n_2-p-1}\)

Under the null hypothesis, \(H_{o}\colon \mu_{1} = \mu_{2}\) this F-statistic will be F-distributed with p and \(n_{1} + n_{2} - p\) degrees of freedom. We would reject \(H_{o}\) at level \(α\) if it exceeds the critical value from the F-table evaluated at \(α\).

\(F > F_{p, n_1+n_2-p-1, \alpha}\)

Example 7-13: Swiss Banknotes (Two-Sample Hotelling's)

The two-sample Hotelling's \(T^{2}\) test can be carried out using the Swiss Banknotes data using the SAS program as shown below:

Data file:  swiss3.csv

Download the SAS Program: swiss10.sas

Download the output: swiss10.lst.

Explore the code below to see how to compute the Two Sample Hotelling's \(T^2\) using the SAS statistical software application.

 

Note: In the upper right-hand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.

options ls=78;
title "2-Sample Hotellings T2 - Swiss Bank Notes";

data swiss;
  infile "D:\Statistics\STAT 505\data\swiss3.csv" firstobs=2 delimiter=','
  input type $ length left right bottom top diag;
  run;

 /* The iml code below defines and executes the 'hotel2' module 
  * for calculating the two-sample Hotelling T2 test statistic. 
  * The commands between 'start' and 'finish' define the 
  * calculations of the module for two input vectors 'x1' and 'x2',
  * which have the same variables but correspond to two separate groups.
  * The 'use' statement makes the 'swiss' data set available, from 
  * which all the variables are taken. The variables are then read 
  * separately into the vectors 'x1' and 'x2' for each group, and
  * finally the 'hotel2' module is called.
  */

proc iml;
  start hotel2;
    n1=nrow(x1);
    n2=nrow(x2);
    k=ncol(x1);
    one1=j(n1,1,1);
    one2=j(n2,1,1);
    ident1=i(n1);
    ident2=i(n2);
    ybar1=x1`*one1/n1;
    s1=x1`*(ident1-one1*one1`/n1)*x1/(n1-1.0);
    print n1 ybar1;
    print s1;
    ybar2=x2`*one2/n2;
    s2=x2`*(ident2-one2*one2`/n2)*x2/(n2-1.0);
    print n2 ybar2;
    print s2;
    spool=((n1-1.0)*s1+(n2-1.0)*s2)/(n1+n2-2.0);
    print spool;
    t2=(ybar1-ybar2)`*inv(spool*(1/n1+1/n2))*(ybar1-ybar2);
    f=(n1+n2-k-1)*t2/k/(n1+n2-2);
    df1=k;
    df2=n1+n2-k-1;
    p=1-probf(f,df1,df2);
    print t2 f df1 df2 p;
  finish;
  use swiss;
    read all var{length left right bottom top diag} where (type="real") into x1;
    read all var{length left right bottom top diag} where (type="fake") into x2;
  run hotel2;
  

At the top of the first output page, you see that N1 is equal to 100 indicating that we have 100 banknotes in the first sample. In this case 100 real or genuine notes.

Computing the 2-sample Hotelling's T2

To compute the two-sample (pooled variance) Hotelling’s T2 statistic:

  1. Open the ‘swiss3’ data set in a new worksheet.
  2. Rename the columns type, length, left, right, bottom, top, and diag.
    1. Stat > ANOVA > General MANOVA
    2. Highlight and select all six response variables (length through diag) to move them to the Responses window.
    3. Highlight and select 'type' to move it to the Model window.
    4. Choose 'OK'. The corresponding F-statistic 391.92 is shown in the results area.
  3. To calculate the specific value of the T2 statistic:
    1. Name new columns in the worksheet p, n1, n2, and F, and enter the values corresponding to this example: 6, 100, 100, and 391.92 in their row.
    2. Calc > Calculator
    3. Enter C12 (or any new column name) for the 'Store result' window.
    4. In the Expression window, enter ‘p’ * (‘n1’ + ‘n2’ - 2) * ‘F’ / (‘n1’ + ‘n2’ - ‘p’ - 1).
    5. Choose 'OK'. The T2 value 2412.44 is reported in the new worksheet column.

Analysis

The sample mean vectors are copied into the table below:

  Means
Variable Genuine Counterfeit
Length 214.969 214.823
Left Width 129.943 130.300
Right Width 129.720 130.193
Bottom Margin 8.305 10.530
Top Margin 10.168 11.133
Diagonal 141.517 139.450

The sample variance-covariance matrix for the real or genuine notes appears below:

\(S_1 = \left(\begin{array}{rrrrrr}0.150& 0.058& 0.057 &0.057&0.014&0.005\\0.058&0.133&0.086&0.057&0.049&-0.043\\0.057&0.086&0.126&0.058&0.031&-0.024\\0.057&0.057&0.058&0.413&-0.263&-0.000\\0.014&0.049&0.031&-0.263&0.421&-0.075\\0.005&-0.043&-0.024&-0.000&-0.075&0.200\end{array}\right)\)

The sample variance-covariance for the second sample of notes, the counterfeit note, is given below:

\(S_2 = \left(\begin{array}{rrrrrr}0.124&0.032&0.024&-0.101&0.019&0.012\\0.032&0.065&0.047&-0.024&-0.012&-0.005\\0.024&0.047&0.089&-0.019&0.000&0.034\\-0.101&-0.024&-0.019&1.281&-0.490&0.238\\ 0.019&-0.012&0.000&-0.490&0.404&-0.022\\0.012&-0.005&0.034&0.238&-0.022&0.311\end{array}\right)\)

This is followed by the pooled variance-covariance matrix for the two samples.

\(S_p = \left(\begin{array}{rrrrrr}0.137&0.045&0.041&-0.022&0.017&0.009\\0.045&0.099&0.066&0.016&0.019&-0.024\\0.041&0.066&0.108&0.020&0.015&0.005\\-0.022&0.016&0.020&0.847&-0.377&0.119\\0.017&0.019&0.015&-0.377&0.413&-0.049\\0.009&-0.024&0.005&0.119&-0.049&0.256\end{array}\right)\)

The two-sample Hotelling's \(T^{2}\) statistic is 2412.45. The F-value is about 391.92 with 6 and 193 degrees of freedom.  The p-value is close to 0 and so we will write this as \(< 0.0001\).

In this case, we can reject the null hypothesis that the mean vector for the counterfeit notes equals the mean vector for the genuine notes given the evidence as usual: (\(T_{2} = 2412.45\); \(F = 391.92\); \(d. f. = 6, 193\); \(p< 0.0001\))

 

Conclusion

The counterfeit notes can be distinguished from the genuine notes on at least one of the measurements.

After concluding that the counterfeit notes can be distinguished from the genuine notes the next step in our analysis is to determine upon which variables they are different.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility