7.1.14 - The Multivariate Case

In this case we are replacing the random variables \(X_{ij}\), for the \(j_{th}\) sample for the \(i_{th}\) population, with random vectors \(X_{ij}\), for the \(j_{th}\) sample for the \(i_{th}\) population. These vectors contain the observations from the p variables.

In our notation, we will have our two populations:

The data will be denoted in Population 1 as: \(X_{11}\),\(X_{12}\), ... , \(X_{1n_{1}}\)
The data will be denoted in Population 2 as: \(X_{21}\), \(X_{22}\), ... , \(X_{2n_{2}}\)

Here the vector \(X_{ij}\) represents all of the data for all of the variables for sample unit j, for population i.

\(\mathbf{X}_{ij} = \left(\begin{array}{c}X_{ij1}\\X_{ij2}\\\vdots\\X_{ijp}\end{array}\right)\)

This vector contains elements \(X_{ijk}\) where k runs from 1 to p, for p different observed variables. So, \(X_{ijk}\) is the observation for variable k of subject j from population i.

The assumptions here will be analogous to the assumptions in the univariate setting.

Assumptions

The data from population i is sampled from a population with mean vector \(\mu_{i}\). Again, this corresponds to the assumption that there are no sub-populations.
Instead of assuming Homoskedasticity, we now assume that the data from both populations have a common variance-covariance matrix \(Σ\).
Independence. The subjects from both populations are independently sampled.
Normality. Both populations are normally distributed.

Consider testing the null hypothesis that the two populations have an identical population mean vectors. This is represented below as well as the general alternative that the mean vectors are not equal.

\(H_0: \boldsymbol{\mu_1 = \mu_2}\) against \(\boldsymbol{\mu_1 \ne \mu_2}\)

So here what we are testing is:

\(H_0\colon \left(\begin{array}{c}\mu_{11}\\\mu_{12}\\\vdots\\\mu_{1p}\end{array}\right) = \left(\begin{array}{c}\mu_{21}\\\mu_{22}\\\vdots\\\mu_{2p}\end{array}\right)\) against \(H_a\colon \left(\begin{array}{c}\mu_{11}\\\mu_{12}\\\vdots\\\mu_{1p}\end{array}\right) \ne \left(\begin{array}{c}\mu_{21}\\\mu_{22}\\\vdots\\\mu_{2p}\end{array}\right)\)

Or, in other words...

\(H_0\colon \mu_{11}=\mu_{21}\) and \(\mu_{12}=\mu_{22}\) and \(\dots\) and \(\mu_{1p}=\mu_{2p}\)

The null hypothesis is satisfied if and only if the population means are identical for all of the variables.

The alternative is that at least one pair of these means is different. This is expressed below:

\(H_a\colon \mu_{1k}\ne \mu_{2k}\) for at least one \( k \in \{1,2,\dots, p\}\)

To carry out the test, for each population i, we will define the sample mean vectors, calculated the same way as before, using data only from the\(i_{th}\) population.

\(\mathbf{\bar{x}}_i = \dfrac{1}{n_i}\sum_{j=1}^{n_i}\mathbf{X}_{ij}\)

Similarly, using data only from the \(i^{th}\) population, we will define the sample variance-covariance matrices:

\(\mathbf{S}_i = \dfrac{1}{n_i-1}\sum_{j=1}^{n_i}\mathbf{(X_{ij}-\bar{x}_i)(X_{ij}-\bar{x}_i)'}\)

Under our assumption of homogeneous variance-covariance matrices, both \(S_{1}\) and \(S_{2}\) are estimators for the common variance-covariance matrix Σ. A better estimate can be obtained by pooling the two estimates using the expression below:

\(\mathbf{S}_p = \dfrac{(n_1-1)\mathbf{S}_1+(n_2-1)\mathbf{S}_2}{n_1+n_2-2}\)

Again, each sample variance-covariance matrix is weighted by the sample size minus 1.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility