7.1.13 - The Univariate Case

Suppose we have data from a single variable from two populations:

The data will be denoted in Population 1 as: \(X_{11}\),\(X_{12}\), ... , \(X_{1n_{1}}\)
The data will be denoted in Population 2 as: \(X_{21}\), \(X_{22}\), ... , \(X_{2n_{2}}\)

For both populations, the first subscript will denote which population the note is from. The second subscript will denote which observation we are looking at from each population.

Here we will make the standard assumptions:

The data from population i is sampled from a population with mean \(\mu_{i}\). This assumption simply means that there are no sub-populations to note.
Homoskedasticity: The data from both populations have common variance \(σ^{2}\)
Independence: The subjects from both populations are independently sampled.
Normality: The data from both populations are normally distributed.

Here we are going to consider testing, \(H_0\colon \mu_1 = \mu_2\) against \(H_a\colon \mu_1 \ne \mu_2\), that the populations have equal means, against the alternative hypothesis that the means are not equal.

We shall define the sample means for each population using the following expression:

\(\bar{x}_i = \dfrac{1}{n_i}\sum_{j=1}^{n_i}X_{ij}\)

We will let \(s^2_i\) denote the sample variance for the \(i^{th}\) population, again calculating this using the usual formula below:

\(s^2_i = \dfrac{\sum_{j=1}^{n_i}X^2_{ij}-(\sum_{j=1}^{n_i}X_{ij})^2/n_i}{n_i-1}\)

Assuming homoskedasticity, both of these sample variances, \(s^2_1\) and \(s^2_2\), are estimates of the common variance \(σ^{2}\). A better estimate can be obtained, however, by pooling these two different estimates yielding the pooled variance as given in the expression below:

\(s^2_p = \dfrac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2}\)

Our test statistic is the students' t-statistic which is calculated by dividing the difference in the sample means by the standard error of that difference. Here the standard error of that difference is given by the square root of the pooled variance times the sum of the inverses of the sample sizes as shown below:

\(t = \dfrac{\bar{x}_1-\bar{x}_2}{\sqrt{s^2_p(\dfrac{1}{n_1}+\frac{1}{n_2})}} \sim t_{n_1+n_2-2}\)

Under the null hypothesis, \(H_{o} \)of the equality of the population means, this test statistic will be t-distributed with \(n_{1}\) + \(n_{2}\) - 2 degrees of freedom.

We will reject \(H_{o}\) at level \(α\) if the absolute value of this test statistic exceeds the critical value from the t-table with \(n_{1}\) + \(n_{2}\) - 2 degrees of freedom evaluated at \(α/2\).

\(|t| > t_{n_1+n_2-2, \alpha/2}\)

All of this should be familiar to you from your introductory statistics course.

Next, let's consider the multivariate case...

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility