Lesson 28: Choosing Appropriate Statistical Methods

If we take a look back at where we've been this semester, we can quickly get the feeling as if we hiked the entire length of the 2,180 mile long Appalachian Trail. Just think about it! Among other things, we've learned about:

Point estimation, including maximum likelihood estimation, method of moments, and sufficiency
Confidence intervals for means, differences in two means, variances, proportions, and differences in two proportions
Determining the sample size necessary to estimate a parameter with a certain error \(\epsilon\)
Linear regression as a way of estimating and testing for the existence of a linear relationship between two continuous variables
Hypothesis testing, including best critical regions and likelihood ratio tests
Hypothesis tests for means, the equality of two means, variances, proportions and the equality of two proportions
Determining the sample size necessary to conduct a hypothesis test for a parameter with a certain power
One-factor analysis of variance as a way of testing for the equality of three or more population means
Two-factor analysis of variance as a way of testing for the effect of one or more qualitative factors on a continuous variable
Chi-square goodness-of-fit tests and contingency tables
Using order statistics to derive distribution-free confidence intervals for percentiles
Nonparametric methods, such as the sign test, the Wilcoxon signed rank test, the run test, and the test for randomness
Using the Kolmogorov-Smirnov test statistic to test for the equality of a particular distribution function \(F_{0}(x)\)
Bayesian methods

That's all well and good, but we haven't really yet had much practice with putting it altogether to choose which of the above statistical methods would be most appropriate for any given situation. For example, suppose we were interested in learning how many times each semester Penn State students go "home." What statistical method(s) would be most appropriate for answering our research question? Or, suppose we were interested in determining whether or not a higher percentage of Alaskans commit suicide than non-Alaskans. What statistical methods could we use? These are the kinds of questions we'll tackle in this lesson. The algorithm that I propose in this lesson is perhaps not flawless, but by using it, I can almost always figure out what kind of analysis is appropriate for any given situation. Choosing the correct analysis depends, at the very least, on the answer to the following four questions:

What type of response variable do we have? More specifically, is it a continuous or categorical variable?
How many groups are being studied or compared? Is it one, two, or more?
What is the research question? Are we asking "is it this," so that we need to conduct a hypothesis test? Or are we asking "what is it," so that we need to calculate a point estimate or a confidence interval?
What assumptions can we safely make about the data? Can we assume that the data are normally distributed? Can we assume the variances of two populations are equal? Are the groups dependent or independent?

As you'll soon see, upon working through the material in this lesson, choosing the correct analysis hinges on the answers to these questions. We'll first start by considering the methods that are available to us when we have one categorical (or perhaps, more specifically, binary) variable. Then, we'll move to the situation in which we have one continuous response variable. And, then two continuous measurements, before concluding with some practice questions.