Lesson 9 - Statistical Analysis Methods

Lesson 9 - Statistical Analysis Methods

Lesson 9 Objectives

Upon completion of this lesson, you should be able to:

  • Use plots, tables, and summary statistics to describe variables and relationships between variables
  • Identify which modeling strategy to use based on the type of data (continuous, categorical, time-to-event)
  • Interpret results of statistical analyses
  • Differentiate between odds ratios, risk ratios, and hazard ratios

 

Epidemiologic data can be analyzed using a variety of statistical methods.  Here we outline the fundamentals according to the type of outcome measure.  We can generally think of outcome data as one of three types: 1) continuous, 2) categorical, and 3) time-to-event.  While it is true that time-to-event is continuous, we do not always observe the true time for each person, so special consideration needs to be given in that scenario.  Once the type of outcome data is known, there are standard techniques one can use to provide descriptive statistics, look at bivariable associations, and use modeling to describe the association between multiple covariates and the outcome.  

Example: a subset of data from the Framingham Heart Study

Our motivating example will be based on the SAS-provided dataset “Heart” which includes a small subset of data from participants in the Framingham Heart Study.  This dataset contains over 5000 patients from the cohort study and provides data on their baseline age, sex, weight, smoking status, cholesterol, blood pressure, and coronary heart disease (CHD) development.  Patients were contacted every 2 years for over 30 years. 


9.1 - Continuous outcome

9.1 - Continuous outcome

From our example, we may be interested in the relationship of age with cholesterol, and want to consider a possible confounder (or effect modifier) of sex.

  • The outcome is cholesterol and is a continuous value.
  • The predictors/covariates to be considered are age and sex.  Age can be either continuous, or put into categories, and sex is a categorical variable.  

Descriptive

For the continuous outcome of cholesterol, first, we can look at the distribution of the data via a histogram and by calculating descriptive statistics:

distribution of cholesterol graph

 

Analysis Variable: Cholesterol
N Mean Std
Dev
Lower
95% CL
for Mean
Upper
95% CL
for Mean
Minimum 25th
Pctl
Median 75th
Pctl
Maximum
5057 227.42 44.94 226.18 228.66 96.00 196.00 223.00 255.00 268.00

Here, we see that cholesterol appears normally distributed, with a mean of 227.4 and confidence interval around the mean of (226.18, 228.66).  This CI is very narrow due to the large sample size.  

Since we have a continuous outcome, we will likely plan to use linear regression.  We can do a  test for normality, but with such a large sample size, even if there appears to be a deviation from normality, it is still reasonable to use linear regression. With smaller datasets, or highly skewed data, a transformation may be necessary.  The Kolmogorov-SMirnov test for normality for cholesterol does result in a significant p-value (p<0.01), but since we have such a large sample size, we will still proceed with linear regression. 

Bivariable Associations

We hypothesize that age is related to cholesterol, with cholesterol increasing with increasing age.  Since age is continuous, we can use it as a continuous predictor, and we may want to categorize it to help with visualization or interpretability.

Treating age as continuous would lead us to look at a scatter plot between the two continuous variables, as well as estimate a correlation coefficient as a measure of association. 

cholesterol regression analysis graph

We see that cholesterol does appear to increase as age increases, and this best fit line suggests a positive slope.  The correlation coefficient between the two variables is 0.27.  A correlation coefficient ranges from -1 to 1 with values closest to 0 indicating no relationship.  The closer to 1 (or -1) the correlation coefficient is, the stronger the correlation.  A correlation coefficient of 1 (or -1) would indicate perfect correlation - as demonstrated by all points falling along a single line. Values closer to 0 indicate no relationship and the graph would just appear to be a random cloud of points. The positive or negative sign of the correlation coefficient indicates if it is a positive or negative correlation.  Positive correlation means that as one variable increases, so does the other, and negative means that as one variable increases, the other decreases.  

We could also group age into categories and look at the relationship.  Here, we would calculate means per group, and could visualize the relationship with boxplots. 

agegrp Frequency Percent Cumulative
Frequency
Cumulative
Percent
<40 1877 36.03% 1877 36.03%
[40-50] 1740 33.42% 3618 69.46%
>=50 1591 30.54% 52.09 100.00%
Analysis Variable: Cholesterol
agegrp N Mean Std
Dev
Lower
95% CL
for Mean
Upper
95% CL
for Mean
Minimum 25th
Pctl
Median 75th
Pctl
Maximum
<40 1819 213.18 41.97 211.25 215.11 115.00 183.00 209.00 235.00 534.00
[40-50] 1690 229.63 42.92 227.59 231.68 117.00 200.00 226.00 253.00 568.00
>=50 1548 241.73 45.49 239.46 244.00 96.00 210.00 238.00 270.00 425.00
cholesterol vs age group boxplot graph

We see that about a third of patients are in each age group (<40, 40 - 50, and 50 and older), and that for each increasing age group, the mean cholesterol is higher.  For the boxplot, the box indicates the 25th, 50th (median), and 75th percentiles as the bottom, middle, and top of the box, respectively.  The marker inside the box shows the mean, which is often close to the median for large sample sizes with normally distributed data.  The whiskers extend out relative to the interquartile range, and data points that fall out of that limit are shown with dots.  

Since we are also interested in sex, we should summarize that vairable as well. Females have higher cholesterol on average than males, but only by about 2 points:

cholesterol vs sex boxplot graph
Analysis Variable: Cholesterol
sex N Mean Std
Dev
Lower
95% CL
for Mean
Upper
95% CL
for Mean
Minimum 25th
Pctl
Median 75th
Pctl
Maximum
Female 2774 228.54 46.92 226.79 230.29 117.00 196.00 224.00 257.00 493.00
Male 2283 226.05 42.37 224.31 227.79 96.00 198.00 223.00 250.00 568.00

Modeling (Multivariable Associations)

In order to look at the relationship of multiple variables with our outcome, we need to move to modeling.  With a continuous outcome, we can use linear regression.  

First we want to see if the differences in cholesterol by age group are significant.  Our model can then be fit with just age group as a covariate and we see:

Analysis of Maximum Likelihood of Parameter Estimates
Parameter   DF Estimate Standard
Error
Wald 95% Confidence
Limits
Wald
Chi-Square
Pr > ChiSq
Intercept   1 213.1781 1.0170 211.1848 215.1715 43934.9 <.0001
agegrp >=50 1 28.5525 1.4999 25.6127 31.4923 362.36 <.0001
agegrp [40-50] 1 16.4550 1.4655 13.5827 19.3273 126.07 <.0001
agegrp <40 0 0.0000 0.0000 0.0000 0.0000    

The estimate for the difference in cholesterol between the oldest and youngest age group is 28.6 (which we can confirm from our earlier descriptive table), the CI for this estimate is (25.6 - 31.5), and the p-value is <0.0001, all clearly providing evidence that there is a significant difference in cholesterol between the oldest and youngest age groups.  A similar conclusion is seen with significantly higher cholesterol in the middle age group compared to the younger - on average about 16.5 points higher.  

Next, we may want to see if this relationship still holds after controlling for sex.  The model including both covariates in the model shows this:

Analysis of Maximum Likelihood of Parameter Estimates
Parameter   DF Estimate Standard
Error
Wald 95% Confidence
Limits
Wald
Chi-Square
Pr > ChiSq
Intercept   1 211.7290 1.2206 209.3367 214.1213 30090.1 <.0001
agegrp >=50 1 28.5721 1.4993 25.6336 31.5107 363.18 <.0001
agegrp [40-50] 1 16.4595 1.4648 13.5885 19.3305 126.26 <.0001
agegrp <40 0 0.0000 0.0000 0.0000 0.0000    
Sex Female 1 2.6280 1.2252 0.2266 5.0293 4.60 0.0320
Sex Male 0 0.0000 0.0000 0.0000 0.0000    

The estimates for differences by age group are still about the same: 28 points higher for oldest vs youngest age group, and 16 points higher for the middle vs youngest group, even after controlling for sex.  Thus, it does not appear that sex is a confounder.  This model is also consistent with the simple descriptives of cholesterol by sex that showed on average females have slightly higher cholesterol (about 2.5 points).

Finally, we may want to investigate if sex is an effect modifier, and thus we also include the interaction term of agegrp*sex.  The p-value for this is significant, and the model estimates show that these are the estimated means per group:

  female male
<40 206.1 221.9
[40-50] 230.2 228.9
>=50 253.4 227.8

We can see that as age group increases, so does cholesterol, but much more dramatically in females.  Thus age group is an effect modifier.  Males have an average cholesterol around 220-230, and this does not seem to change with age.  Females, on the other hand, have a greater change in cholesterol with increasing age.  We can see this better by graphing the means by group and seeing that the mean cholesterol for males is mainly flat line, but the line connecting the means for females has a slope.

mean cholesterol by age group graph

9.2 - Categorical outcome

9.2 - Categorical outcome

From our example, we may be interested in the relationship of BMI with high blood pressure, and want to consider a possible confounder (or effect modifier) of sex.

The outcome is high blood pressure and is a dichotomous value (either present or not).
The predictors/covariates to be considered are BMI and sex.  BMI can be either continuous or put into categories, and sex is a categorical variable.

Descriptive

First, we want to report the percentage of patients who have high blood pressure with a 95% confidence interval.  We find that 43.5% of patients report high blood pressure.  The exact CI for this estimate is (42.2% - 44.8%), again very narrow due to the large sample size.  

high_BP Frequency Percent Cumulative
Frequency
Cumulative
Percent
low/normal 2942 56.48% 2942 56.48%
high 2267 43.52% 5209 100.00%

Bivariable Associations

Next we want to look at the relationship with BMI, and can consider BMI as both continuous and categorical variables

Table of BMIGrp by high_BP
BMIgrp high_BP
low/normal high Total
[18/5-25] normal 1727
(70.32%)
729
(29.68%)
2456
[25-30] overwght 953
(48.38%)
1017
(51.62%)
1970
>= obese 188
(27.09%)
506
(72.91%)
694
Total 2868 2252 5120
Frequency Missing = 89
Percent of Frequency vs BMIgrp divided against high_BP groups graph

We see that as BMI level increases, so does the rate of high BP (30%, 52%, and 73% for increasing levels of BMI).  We can use a chi-squared test here to test the association between the two variables.  It is highly significant, and not surprisingly so, due to the large sample size.

Considerations specifically related to Non-matched Case-Control Studies:

  1. Chi-squared tests can be used for the bivariable association of exposure and outcome.  If any cell counts are less than 5, Fisher’s Exact tests should be used instead.  
  2. If we want to evaluate potential effect modifiers using these types of bivariable association tables, we can use the Mantel-Haenszel statistic, which essentially breaks the exposure * outcome table up by potential effect modifier to evaluate if there are different effects for different strata. 

We can also look at a boxplot or histogram for the continuous version of BMI and see that on average, patients with high BP tend to have higher BMI compared to those without high BP.  

boxplot comparison of high_BP vs BMI
Distribution plots of BMI vs. high BP and low/normal BP
Analysis Variable: bmi
high_BP N Mean Std
Dev
Lower
95% CL
for Mean
Upper
95% CL
for Mean
Minimum 25th
Pctl
Median 75th
Pctl
Maximum
low/normal 2936 24.37 3.61 24.24 24.50 14.12 21.87 24.01 26.49 51.96
high 2263 27.15 4.48 26.97 27.34 15.77 24.05 26.73 29.52 56.68

We can use a two group t-test to compare the means by group, but it is often more streamlined to consider the modeling technique you plan to use, and use that for both bivariable and multivariable associations.  The model with just a single covariate in the model will provide an unadjusted result, and the model with multiple covariates will provide an adjusted result. 

Modeling (Multivariable Associations)

For a dichotomous outcome we may want to estimate odds ratios or risk ratios, and thus will use logistic or log-binomial regression, respectively.  

Logistic regression to estimate Odds Ratio:

Using the table which shows the raw counts in each BMI group who have high BP, we can calculate the OR of high BP for the overweight vs normal BMI group as (1017*1727)/(729*953) = 2.52, and similarly for the obese vs normal groups as (506*1727)/(729*188) = 6.38.

From the logisitic regression model, we get these same estimates, along with 95% CIs:

Label Estimate Standard
Error
Confidence Limits
(OR overwght vs. normal) 2.5281 0.1596 2.2339 2.8610
(OR obese vs. normal) 6.3761 0.6131 5.2809 7.6985

Considerations specifically related to Case-Control Studies:

Remember that for non-matched case-control studies, OR must be calculated since the distribution of exposure is not necessarily representative of the population.  The sampling fractions cancel out in the OR calculation, but not in the RR.  
These logistic regression models can be considered unconditional, which is appropriate for non-matched case control studies, but not MATCHed case control studies. For matched case control studies conditional logistic regression modeling should be used, and the OR is calculated based on concordant and discordant pairs.  

Log-binomial regression to estimate Risk Ratio:

Using the table which shows the raw counts in each BMI group who have high BP, we can calculate the RR of high BP for the overweight vs normal BMI group as (1017/1970)/(729/2456) = 1.74, and similarly for the obese vs normal groups as (506/694)/1017/2456)) = 1.41.

From the log-binomial regression model, we get these same estimates, along with 95% CIs: (note that these are the unadusted RR, which can also be calculated just from the table in the previous section).  Note that if the log-binomial model does not converge, modified poisson regression modeling can be used. 

Label Estimate Standard
Error
Confidence Limits
(RR overwght vs. normal) 1.4536 0.0267 1.3794 1.5317
(RR obese vs. normal) 2.5958 0.0636 2.2914 2.9406

As stated earlier, we often want the RR, so we’ll proceed with those estimates.  But notice how the RR are less extreme than the OR, which is often the case.  And if readers don’t know the distinction between OR and RR, and assume the OR can be interpreted as the RR, they will incorrectly overestimate the difference in risk between groups.

If we want adjusted RR, we can simply add the other covariates to the model. In this case we want to see if sex is a possible confounder or effect modifier.  Adding sex to the model, does not meaningfully change the RR based on BMI (the estimates are essentially the same), thus sex is not a confounder.

  • RR overwght v normal = 1.44
  • RR for obese v normal = 2.59

The model also shows that sex is a significant predictor of blood pressure.  (In the unadjusted setting, we see that the rates of high BP in males is 46% compared to 41% in females - not a huge clinical difference).  The adjusted RR for female v males is 1.06 (95% CI: 1.01 - 1.11).  Suggesting a small increased risk of high BP for females compared to males.

To evaluate sex as a potential effect modifier, we can include an interaction term in the model.  Doing so shows no statistical evidence of an interaction, thus we can assume the relationship between BMI and high blood is similar for both males and females.  If the interaction had been significant, the next step would be to provide stratified analyses, where we estimate RR estimates for BMI with high blood separately for females and males. 


9.3 - Time-to-event outcome

9.3 - Time-to-event outcome

Descriptive

Examples of time-to-event data are:

  • Time to death
  • Time to development of a disease
  • Time to first hospitalization
  • And many others

One may think that time-to-event data is simply continuous, but since we do not observe the true time for each person in the dataset, this is not the case.  The people who do not experience the event still contribute valuable information, and we refer to these patients as “censored”.  We use the time they contribute until they are censored, which is the time they stop being followed because the study has ended, they are lost to follow-up, or they withdraw from the study.   

For our example, we are interested in the time to development of coronary heart disease (CHD).  No patients had CHD upon study entry, and patients were surveyed every 2 years to see if they had developed CHD.  Each patient’s “time-to-CHD” will fall into one of these categories:

  1. They develop CHD within the 30-year study period

    Time = years until they develop CHD
    Status = event

  2. They do not develop CHD within the 30-year study period, and they stay in the study until the end

    Time = 30 years
    Status = censored

  3. They do not develop CHD within the 30-year study period, and they leave the study before the 30-year study period is finished (due to death, moving, lost contact, voluntarily withdraw, etc.)

    Time = time on study
    Status = censored

The best way to describe time-to-event data is by the Kaplan-Meier method.  This uses information from all patients, and differentiates between patients who did and did not experience the event.  A Kaplan Meier (KM) plot is how we visualize time-to-event data and starts with all patients being event-free at time 0.  The KM method uses the number of patients still at risk over time, and patients drop out once they experience the event or are censored.  A Kaplan Meier plot and a Cumulative Incidence plot are inverses of each other, so you can choose which best fits your data.  Often for “Overall Survival” we use KM plots, which start at 100%, and decrease over time as patients either die or are censored.  This can really be considered as plotting the percentage of patients still alive.  For our example, it makes more sense to look at a cumulative incidence plot, which starts at 0% and shows how the incidence of CHD increases over time.  (A KM plot would plot the percent of people who are CHD-free, and this would decrease over time.)  

Product-limit failure curve for CHDtimeyrs

This plot shows that over time CHD is increasing, and we can get estimates of rates of CHD at different time points using the KM estimate.  

Bivariable

When comparing time-to-event data between groups, we can use the KM method again, as well as perform a log-rank test.  For our example, suppose we want to compare time to CHD by BP status.   

Product-limit failure curves for high BP and low/normal BP

This plot shows that those with high BP at study entry (blue line) have higher rates of CHD than those with low or normal BP (red line).  The KM estimates of CHD at 10 years are 12.7% for the high BP group and 4.7% for the low/normal group.  At 20 years, these estimates are 26.1% and 12.0%.  The log-rank test is essentially a comparison of lines, not specifically comparing estimates at any single point, and is highly significant here (p<0.0001). 

Modeling (Multivariable Associations)

We can use Cox Proportional Hazards modeling to estimate the hazard ratio.  This model uses the hazard function which is the probability that if a person survives to time t, they will experience the event in the next instant.
Just from eyeballing the previous plot, it appears that the risk of CHD is about twice as high for those with high BP compared to those with low/normal. Actually fitting a Cox model with high BP as a single covariate shows that the estimated hazard ratio is 1.87 (95% CI: 1.69 - 2.08), which fits with what we see in the plot.  
The Cox models can also include multiple covariates to test for confounding and interaction terms to evaluate effect modification, similar to those in previous sections.  With additional terms in the model, we can estimate adjusted hazard ratios. 


9.4 - Summary

9.4 - Summary

When planning analyses for a study it is important to be clear about what type of data you’ll have.  Once you know if the outcome measure is continuous, categorical, or time-to-event, you can choose the appropriate methods.  Understanding your data is very important, so do not skip the step of looking at descriptive statistics first, including looking at distributions and graphs whenever possible.  Next, you can start to look at associations between variables (bivariable) to get a sense of how variables relate to one another.  This step can and should also use graphs and tables to visualize data whenever helpful.  Once these relationships are understood, modeling techniques can be used.  Models allow for both unadjusted and adjusted estimates to be calculated and can include more than one covariate.  Modeling can be used to evaluate potential confounding along with effect modification.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility