From our example, we may be interested in the relationship of age with cholesterol, and want to consider a possible confounder (or effect modifier) of sex.

- The outcome is cholesterol and is a continuous value.
- The predictors/covariates to be considered are age and sex. Age can be either continuous, or put into categories, and sex is a categorical variable.

## Descriptive

For the continuous outcome of cholesterol, first, we can look at the distribution of the data via a histogram and by calculating descriptive statistics:

Analysis Variable: Cholesterol | |||||||||
---|---|---|---|---|---|---|---|---|---|

N | Mean | Std Dev |
Lower 95% CL for Mean |
Upper 95% CL for Mean |
Minimum | 25th Pctl |
Median | 75th Pctl |
Maximum |

5057 | 227.42 | 44.94 | 226.18 | 228.66 | 96.00 | 196.00 | 223.00 | 255.00 | 268.00 |

Here, we see that cholesterol appears normally distributed, with a mean of 227.4 and confidence interval around the mean of (226.18, 228.66). This CI is very narrow due to the large sample size.

Since we have a continuous outcome, we will likely plan to use linear regression. We can do a test for normality, but with such a large sample size, even if there appears to be a deviation from normality, it is still reasonable to use linear regression. With smaller datasets, or highly skewed data, a transformation may be necessary. The Kolmogorov-SMirnov test for normality for cholesterol does result in a significant p-value (p<0.01), but since we have such a large sample size, we will still proceed with linear regression.

## Bivariable Associations

We hypothesize that age is related to cholesterol, with cholesterol increasing with increasing age. Since age is continuous, we can use it as a continuous predictor, and we may want to categorize it to help with visualization or interpretability.

Treating age as continuous would lead us to look at a scatter plot between the two continuous variables, as well as estimate a correlation coefficient as a measure of association.

We see that cholesterol does appear to increase as age increases, and this best fit line suggests a positive slope. The correlation coefficient between the two variables is 0.27. A correlation coefficient ranges from -1 to 1 with values closest to 0 indicating no relationship. The closer to 1 (or -1) the correlation coefficient is, the stronger the correlation. A correlation coefficient of 1 (or -1) would indicate perfect correlation - as demonstrated by all points falling along a single line. Values closer to 0 indicate no relationship and the graph would just appear to be a random cloud of points. The positive or negative sign of the correlation coefficient indicates if it is a positive or negative correlation. Positive correlation means that as one variable increases, so does the other, and negative means that as one variable increases, the other decreases.

We could also group age into categories and look at the relationship. Here, we would calculate means per group, and could visualize the relationship with boxplots.

agegrp | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|

<40 | 1877 | 36.03% | 1877 | 36.03% |

[40-50] | 1740 | 33.42% | 3618 | 69.46% |

>=50 | 1591 | 30.54% | 52.09 | 100.00% |

Analysis Variable: Cholesterol | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|

agegrp | N | Mean | Std Dev |
Lower 95% CL for Mean |
Upper 95% CL for Mean |
Minimum | 25th Pctl |
Median | 75th Pctl |
Maximum |

<40 | 1819 | 213.18 | 41.97 | 211.25 | 215.11 | 115.00 | 183.00 | 209.00 | 235.00 | 534.00 |

[40-50] | 1690 | 229.63 | 42.92 | 227.59 | 231.68 | 117.00 | 200.00 | 226.00 | 253.00 | 568.00 |

>=50 | 1548 | 241.73 | 45.49 | 239.46 | 244.00 | 96.00 | 210.00 | 238.00 | 270.00 | 425.00 |

We see that about a third of patients are in each age group (<40, 40 - 50, and 50 and older), and that for each increasing age group, the mean cholesterol is higher. For the boxplot, the box indicates the 25th, 50th (median), and 75th percentiles as the bottom, middle, and top of the box, respectively. The marker inside the box shows the mean, which is often close to the median for large sample sizes with normally distributed data. The whiskers extend out relative to the interquartile range, and data points that fall out of that limit are shown with dots.

Since we are also interested in sex, we should summarize that vairable as well. Females have higher cholesterol on average than males, but only by about 2 points:

Analysis Variable: Cholesterol | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|

sex | N | Mean | Std Dev |
Lower 95% CL for Mean |
Upper 95% CL for Mean |
Minimum | 25th Pctl |
Median | 75th Pctl |
Maximum |

Female | 2774 | 228.54 | 46.92 | 226.79 | 230.29 | 117.00 | 196.00 | 224.00 | 257.00 | 493.00 |

Male | 2283 | 226.05 | 42.37 | 224.31 | 227.79 | 96.00 | 198.00 | 223.00 | 250.00 | 568.00 |

## Modeling (Multivariable Associations)

In order to look at the relationship of multiple variables with our outcome, we need to move to modeling. With a continuous outcome, we can use linear regression.

First we want to see if the differences in cholesterol by age group are significant. Our model can then be fit with just age group as a covariate and we see:

Analysis of Maximum Likelihood of Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|

Parameter | DF | Estimate | Standard Error |
Wald 95% Confidence Limits |
Wald Chi-Square |
Pr > ChiSq | ||

Intercept | 1 | 213.1781 | 1.0170 | 211.1848 | 215.1715 | 43934.9 | <.0001 | |

agegrp | >=50 | 1 | 28.5525 | 1.4999 | 25.6127 | 31.4923 | 362.36 | <.0001 |

agegrp | [40-50] | 1 | 16.4550 | 1.4655 | 13.5827 | 19.3273 | 126.07 | <.0001 |

agegrp | <40 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |

The estimate for the difference in cholesterol between the oldest and youngest age group is 28.6 (which we can confirm from our earlier descriptive table), the CI for this estimate is (25.6 - 31.5), and the p-value is <0.0001, all clearly providing evidence that there is a significant difference in cholesterol between the oldest and youngest age groups. A similar conclusion is seen with significantly higher cholesterol in the middle age group compared to the younger - on average about 16.5 points higher.

Next, we may want to see if this relationship still holds after controlling for sex. The model including both covariates in the model shows this:

Analysis of Maximum Likelihood of Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|

Parameter | DF | Estimate | Standard Error |
Wald 95% Confidence Limits |
Wald Chi-Square |
Pr > ChiSq | ||

Intercept | 1 | 211.7290 | 1.2206 | 209.3367 | 214.1213 | 30090.1 | <.0001 | |

agegrp | >=50 | 1 | 28.5721 | 1.4993 | 25.6336 | 31.5107 | 363.18 | <.0001 |

agegrp | [40-50] | 1 | 16.4595 | 1.4648 | 13.5885 | 19.3305 | 126.26 | <.0001 |

agegrp | <40 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | ||

Sex | Female | 1 | 2.6280 | 1.2252 | 0.2266 | 5.0293 | 4.60 | 0.0320 |

Sex | Male | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |

The estimates for differences by age group are still about the same: 28 points higher for oldest vs youngest age group, and 16 points higher for the middle vs youngest group, even after controlling for sex. Thus, it does not appear that sex is a confounder. This model is also consistent with the simple descriptives of cholesterol by sex that showed on average females have slightly higher cholesterol (about 2.5 points).

Finally, we may want to investigate if sex is an effect modifier, and thus we also include the interaction term of agegrp*sex. The p-value for this is significant, and the model estimates show that these are the estimated means per group:

female | male | |
---|---|---|

<40 | 206.1 | 221.9 |

[40-50] | 230.2 | 228.9 |

>=50 | 253.4 | 227.8 |

We can see that as age group increases, so does cholesterol, but much more dramatically in females. Thus age group is an effect modifier. Males have an average cholesterol around 220-230, and this does not seem to change with age. Females, on the other hand, have a greater change in cholesterol with increasing age. We can see this better by graphing the means by group and seeing that the mean cholesterol for males is mainly flat line, but the line connecting the means for females has a slope.