7.1.4 - Example: Women’s Survey Data and Associated Confidence Intervals

Example 7-1: Woman's Survey Data (Hotelling's \(T^{2}\) Test) Section

The data are stored in the file that can be downloaded: nutrient.csv

Example
Example

The Hotelling's \(T^{2}\) test is calculated using the SAS program as shown below.

It turns out that SAS does not have any procedures for calculating Hotelling's \(T^{2}\). So, in this case, we are going to have to rely on the simple procedure, which can carry out matrix manipulation. And in this case, it can be used for carrying out Hotelling's \(T^{2}\) test.

Download the SAS Program: nutrient4.sas

To use this program, I recommend copying the program here and then making the necessary changes to use it with your dataset on your homework assignment. Again, only three entries are all we really need to change to use this with any dataset. First the specified value of μ_o. Second, which dataset you want to use, and third, the specification of what variable you want to analyze.

Download the results printed in the output: nutrient4.lst.

Explore the code below to see how to find the Hotelling's \(T^2\) value using the SAS statistical software application.

Note: In the upper right-hand corner of the code block you will have the option of copying () the code to your clipboard or downloading () the file to your computer.

options ls=78;
title "Hotellings T2 - Women's Nutrition Data";

data nutrient;
  infile "D:\Statistics\STAT 505\data\nutrient.csv" firstobs=2 delimiter=',';
  input id calcium iron protein a c;
  run;
  
  /* iml code to compute the hotelling t2 statistic
  * hotel is the name of the module we define here
  * mu0 is the null vector
  * one is a vector of 1s
  * ident is the identity matrix
  * ybar is the vector of sample means
  * s is the sample covariance matrix
  * t2 is the squared statistical distance between ybar and mu0
  * f is the final form of the t2 statistic after scaling
  * to have an f-distribution
  * the module definition is ended with the 'finish' statement
  * use nutrient makes the data set 'nutrient' available
  * the variables from nutrient are input to x and hotel module is called
  */

proc iml;
  start hotel;
    mu0={1000, 15, 60, 800, 75};
    one=j(nrow(x),1,1);
    ident=i(nrow(x));
    ybar=x`*one/nrow(x);
    s=x`*(ident-one*one`/nrow(x))*x/(nrow(x)-1.0);
    print mu0 ybar;
    print s;
    t2=nrow(x)*(ybar-mu0)`*inv(s)*(ybar-mu0);
    f=(nrow(x)-ncol(x))*t2/ncol(x)/(nrow(x)-1);
    df1=ncol(x);
    df2=nrow(x)-ncol(x);
    p=1-probf(f,df1,df2);
    print t2 f df1 df2 p;
  finish;
  use nutrient;
  read all var{calcium iron protein a c} into x;
  run hotel;

Finding the Hotelling's T2 value

To calculate the one-sample Hotelling T2 statistic:

Analysis

The recommended intake and sample mean are given below

Variable	Recommended Intake (\(μ_{o}\))	Mean
Calcium	1000 mg	624.0 mg
Iron	15mg	11.1 mg
Protein	60g	65.8 g
Vitamin A	800 μg	839.6 μg
Vitamin C	75 mg	78.9 mg

as well as the sample variance-covariance matrix:

\(S = \left(\begin{array}{rrrrr}157829.4 & 940.1 & 6075.8 & 102411.1 & 6701.6 \\ 940.1 & 35.8 & 114.1 & 2383.2 & 137.7 \\ 6075.8 & 114.1 & 934.9 & 7330.1 & 477.2 \\ 102411.1 & 2383.2 & 7330.1 & 2668452.4 & 22063.3 \\ 6701.6 & 137.7 & 477.2 & 22063.3 & 5416.3 \end{array}\right)\)

Hotelling’s T-square comes out to be:

\(T^2 = 1758.54\)

The F-statistic is:

\(F = 349.80 > 3.042 = F_{5,732,0.01}\)

For an 0.01 level test, the critical value is approximately 3.02. Because 349.80 is greater than this value, we can reject the null hypothesis that the average dietary intake meets recommendations.

\((T^2 = 1758.54; F = 349.80; d.f. = 5,732; p < 0.0001)\)

The SAS program reports the p-value as 0.00. In this case, the p-values can never equal zero. It is preferable to state that the p-value is less than 0.0001.

Conclusion

For all women between 25 and 50 years old, the average dietary intake does not meet recommended standards. Returning to the table of sample means and recommended dietary intake, it appears that women fail to meet nutritional standards for calcium and iron, and perhaps exceed intakes for protein, vitamin A, and vitamin C.

Such a statement, however, is not entirely backed up by the evidence.

A Question Emerges...

For which nutrients do women fall significantly below recommended nutritional intake levels? Or, conversely, for what nutrients do women fall significantly above recommended nutritional intake levels?

A naive approach to addressing the above is to calculate Confidence Intervals for each of the nutritional intake levels, one at a time, using the univariate method as shown below:

\(\bar{x}_j \pm t_{n-1, \alpha/2} \sqrt{s^2_j/n}\)

If we consider only a single variable, we can say with \((1 - α) × 100\%\) confidence that the interval includes the corresponding population mean.

Example 7-2: Women’s Health Survey: Section

A one-at-a-time 95% confidence interval for calcium is given by the following where values are substituted into the formula and the calculations are worked out as shown below:

\(624.04925 \pm t_{736, 0.025}\sqrt{157829.4/737}\)

\(624.04925 \pm 1.96 \times 14.63390424\)

\(624.04925 \pm 28.68245\)

\((595.3668, 652.7317)\)

The one-at-a-time confidence intervals are summarized in the table below:

Variable	\(μ_{0}\)	95% Confidence Interval
Calcium	1000 mg	595.3, 652.8
Iron	15mg	10.7, 11.6
Protein	60g	63.6, 68.0
Vitamin A	800 μg	721.5, 957.8
Vitamin C	75 mg	73.6, 84.2

Looking at this table, it appears that the average daily intakes of calcium and iron are too low (because the intervals fall below the recommended intakes of these variables), and the average daily intake of protein is too high (because the interval falls above the recommended intake of protein).

Problem: The problem with these one-at-a-time intervals is that they do not control for a family-wide error rate.

Consequence: We are less than 95% confident that all of the intervals simultaneously cover their respective means.

To fix this problem we can calculate a \((1 - \alpha) × 100\%\) Confidence Ellipse for the population mean vector \(\boldsymbol{\mu}\). To calculate this confidence ellipse we must recall that for independent random observations from a multivariate normal distribution with mean vector \(\mu\) and variance-covariance matrix \(Σ\), the F-statistic, (shown below), is F-distributed with p and n-p degrees of freedom:

\(F = n\mathbf{(\bar{x} - \boldsymbol{\mu})'S^{-1}(\bar{x} - \boldsymbol{\mu})}\dfrac{(n-p)}{p(n-1)} \sim F_{p,n-p}\)

This next expression says that the probability that n times the squared Mahalanobis distance between the sample mean vector, \(\boldsymbol{\bar{x}}\), and the population mean vector \(\boldsymbol{\mu}\) is less than or equal to p times n-1 times the critical value from the F-table divided by n-p is equal to \(1 - α\).

\(\text{Pr}\{n\mathbf{(\bar{x} - \boldsymbol{\mu})'S^{-1}(\bar{x} - \boldsymbol{\mu})} \le \dfrac{p(n-1)}{(n-p)}F_{p,n-p,\alpha}\} = 1-\alpha\)

Here the squared Mahalanobis distance between \(\bar{\mathbf{x}}\) and \(\boldsymbol{\mu}\) is being used.

Note! A closely-related equation for a hyper-ellipse is:

\(n\mathbf{(\bar{x} -\boldsymbol{\mu})'S^{-1}(\bar{x} - \boldsymbol{\mu})} = \dfrac{p(n-1)}{(n-p)}F_{p,n-p, \alpha}\)

In particular, this is the \((1 - \alpha) × 100%\) confidence ellipse for the population mean, \(\boldsymbol{\mu}\).

Interpretation

The \((1 - \alpha) × 100\%\) confidence ellipse is very similar to the prediction ellipse that we discussed earlier in our discussion of the multivariate normal distribution. A \((1 - \alpha) × 100\%\) confidence ellipse yields simultaneous \((1 - \alpha) × 100\%\) confidence intervals for all linear combinations of the variable means. Consider linear combinations of population means as below:

\(c_1\mu_1 + c_2\mu_2 + \dots c_p \mu_p = \sum_{j=1}^{p}c_j\mu_j = \mathbf{c'\boldsymbol{\mu}}\)

The simultaneous \((1 - \alpha) × 100\%\) confidence intervals for the above are given by the expression below

\(\sum_{j=1}^{p}c_j\bar{x}_j \pm \sqrt{\frac{p(n-1)}{(n-p)}F_{p, n-p, \alpha}}\sqrt{\frac{1}{n}\sum_{j=1}^{p}\sum_{k=1}^{p}c_jc_ks_{jk}}\)

In terms of interpreting the \((1 - \alpha) × 100\%\) confidence ellipse, we can say that we are \((1 - \alpha) × 100%\) confident that all such confidence intervals cover their respective linear combinations of the treatment means, regardless of what linear combinations we may wish to consider. In particular, we can consider the trivial linear combinations which correspond to the individual variables. So this says that we going to be also \((1 - \alpha) × 100\%\) confident that all of the intervals given in the expression below:

\(\bar{x}_j \pm \sqrt{\frac{p(n-1)}{(n-p)}F_{p, n-p, \alpha}}\sqrt{\frac{s^2_j}{n}}\)

cover their respective treatment population means. These intervals are called simultaneous confidence intervals.

Example 7-3: Women’s Health Survey Section

Simultaneous confidence intervals are computed using hand calculations. For calcium, we substituted the following values: The sample mean was 624.04925. We have p = 5 variables, a total sample size of n = 737, and if we look up in the F-table for 5 and 732 degrees of freedom for alpha = 0.05, the critical value is 2.21. The standard error of the sample mean for calcium is equal to the square root of the sample variance for calcium, 157,829.44, divided by the sample size, 737. The math is carried out to obtain an interval running from 575.27 to approximately 672.83 as shown below:

\(624.04925 \pm \sqrt{\frac{5(737-1)}{737-5}\underset{2.21}{\underbrace{F_{5,737-5,0.05}}}}\sqrt{157829.4/737}\)

\(624.04925 \pm 3.333224 \times 14.63390424\)

\(624.04925 \pm 48.77808\)

\((575.27117, 672.82733)\)

Example
Example

These calculations may also be carried out using the SAS program.

SAS Program can be downloaded: nutrient5.sas

In terms of using this program, there is only a couple of things you need to change: the value of p in line five and what appears in the data step at the top of the page.

Explore the code below to see how to find the Hotelling's \(T^2\) value using the SAS statistical software application.

Note: In the upper right-hand corner of the code block you will have the option of copying () the code to your clipboard or downloading () the file to your computer.

options ls=78;
title "Confidence Intervals - Women's Nutrition Data";

/* %let allows the p variable to be used throughout the code below
  * After reading in the nutrient data, where each variable is
  * originally in its own column, the next statements stack the data
  * so that all variable names are in one column called 'variable',
  * and all response values are in another column called 'x'.
  * This format is used for the calculations that follow.
  */

%let p=5;
data nutrient;
  infile "D:\Statistics\STAT 505\data\nutrient.csv" firstobs=2 delimiter=',';
  input id calcium iron protein a c;
  variable="calcium"; x=calcium; output;
  variable="iron";    x=iron;    output;
  variable="protein"; x=protein; output;
  variable="vit a";   x=a;       output;
  variable="vit c";   x=c;       output;
  keep variable x;
  run;

proc sort;
  by variable;
  run;

 /* The means procedure calculates and saves the sample size,
  * mean, and variance for each variable. It then saves these results 
  * in a new data set 'a' for use in the final step below.
  * /

proc means noprint;
  by variable;
  var x;
  output out=a n=n mean=xbar var=s2;
  run;

 /* The data step here is used to calculate the confidence interval
  * limits from the statistics calculated in the data set 'a'.
  * The values 't1', 'tb', and 'f' are the critical values used in the 
  * one-at-a-time, Bonferroni, and F intervals, respectively.
  * /

data b;
  set a;
  t1=tinv(1-0.025,n-1);
  tb=tinv(1-0.025/&p,n-1);
  f=finv(0.95,&p,n-&p);
  loone=xbar-t1*sqrt(s2/n);
  upone=xbar+t1*sqrt(s2/n);
  losim=xbar-sqrt(&p*(n-1)*f*s2/(n-&p)/n);
  upsim=xbar+sqrt(&p*(n-1)*f*s2/(n-&p)/n);
  lobon=xbar-tb*sqrt(s2/n);
  upbon=xbar+tb*sqrt(s2/n);
  run;

proc print data=b;
  run;

In terms of using the program with other datasets, basically what you need to do is create a separate line starting with the variable for each variable that is included in the analysis. In this case, we have five nutritional variables so we have five of these lines in the first data step. Then you set this equal to, in quotes, the name you wish to give that variable. After the semicolon, we set x = to the name that we specified for that variable in the input statement and finally after another semicolon, we type in "output;".

The output is available to download here: nutrient5.lst.

The output contains the sample means for each of the variables, and gives the results of the calculations under data step "b".

Confidence intervals are given by the columns for "losim" and "upsim"

Minitab does not support this function.

The following table gives the confidence intervals:

Variable	\(μ_{0}\)	95% Confidence Interval
Calcium	1000 mg	575.1, 673.0
Iron	15 mg	10.4, 11.9
Protein	60 g	62.0, 69.6
Vitamin A	800 μg	638.3, 1040.9
Vitamin C	75 mg	69.9, 88.0

Looking at these simultaneous confidence intervals we can see that the upper bound of the interval for calcium falls below the recommended daily intake of 1000 mg. Similarly, the upper bound for iron also falls below the recommended daily intake of 15 mg. Conversely, the lower bound for protein falls above the recommended daily intake of 60 g of protein. The intervals for both vitamins A and C both contain the recommended daily intake for these two vitamins.

Therefore, we can conclude that the average daily intake of calcium and iron fall below the recommended levels, and the average daily intake of protein exceeds the recommended level.

Problem with Simultaneous Confidence Intervals

The problem with the simultaneous confidence intervals is that if we are not interested in all possible linear combinations of variables or anything other than the means by themselves, then the simultaneous confidence interval may be too wide, and hence, too conservative. As an alternative to the simultaneous confidence intervals, we can use the Bonferroni intervals.

Bonferroni Intervals

The Bonferroni intervals are given in the expression below:

\(\bar{x}_j \pm t_{n-1, \frac{\alpha}{2p}}\sqrt{s^2_j/n}\)

An example from the USDA Women’s Health Survey data will illustrate this calculation.

Example 7-4: Women’s Health Survey Section

Here, the 95% confidence interval for calcium under the Bonferroni correction is calculated:

\(624.04925 \pm t_{737-1, \frac{0.05}{2 \times 5}}\sqrt{157829.4/737}\)

\(624.04925 \pm 2.576 \times 14.63390424\)

\(624.04925 \pm 37.69694\)

\((586.35231, 661.74619)\)

This calculation uses the values for the sample mean for calcium, 624.04925, the critical value from the t-table, with n-1 degrees of freedom, evaluated at alpha over 2 times p, (0.05 divided 2 times 5, or .005). This critical value turns out to be 2.576 from the t-table. The standard error is calculated by taking the square root of the sample variance, (157,829.44), divided by the sample size, 737.

Carrying out the math, the interval goes from 586.35 to 661.75.

Example
Example

These calculations can also be obtained in the SAS program with the downloadable code below. The calculations of the upper and lower Bonferroni intervals are given by "lobon" and "upbon" at the end of data step "b". They involve the calculations:

lobon=xbar-tb*sqrt(s2/n);
upbon=xbar+tb*sqrt(s2/n);

SAS Program can be downloaded here: nutrient5.sas

At this time Minitab does not support this procedure.

The table below shows the results of the computation

Variable	\(\mu_{0}\)	95% Confidence Interval
Calcium	1000 mg	586.3, 661.8
Iron	15mg	10.6, 11.7
Protein	60g	62.9, 68.7
Vitamin A	800 μg	684.2, 995.0
Vitamin C	75 mg	71.9, 85.9

When compared to the simultaneous intervals, we see that the Bonferroni intervals are narrower. However, in this case, the conclusions will be the same. The confidence intervals for both vitamins A and C both cover their respective recommended daily intakes. Intervals for calcium and iron fall below the recommended intake, while the interval for protein falls above it.