11: Overview of Advanced Statistical Topics

11: Overview of Advanced Statistical Topics

Overview

 Case-Study

Jin continues to recognize more and more of the statistical terminology in his research articles for his community development master’s degree. While his confidence is growing he isn’t quite sure about some of the techniques he is encountering. He recognizes pieces of the material, but can’t quite connect it to the material he has already learned. Let’s take a closer look at four of his articles and help him draw the connections.

Most introductory statistics classes, including this one, leave off at the point where most applied statistics actually start. As a consumer of statistics you might notice the foundational principles of introductory statistics embedded in journal articles or research reports, however the main statistical technique goes well beyond the level covered in the course. Much like a kiddie roller coaster is a miniaturized version of a real roller coaster, the material to this point in the course has prepared you to begin to ride the “real” version of the ride. This lesson, is a very high overview of some of the more common advanced statistical techniques.

Objectives

Upon completion of this lesson, you should be able to:

  • Identify the similarities and differences between simple linear regression and advanced regression
  • Identify the appropriate application of factor analysis
  • Identify the similarities and differences between one-way ANOVA and repeated ANOVA
  • Identify violations of parametric techniques leading to the application of non-parametric techniques

11.1 - Multiple regression

11.1 - Multiple regression

 Case-Study: First Article

Let’s take a look at Jin’s first article, examining the impact of several housing demographics on unemployment in mid-west towns. Specifically, the article is predicting the unemployment from three other variables.

Foundational Concepts: The key foundational concepts this article builds upon are Predicting a quantitative response variable from a quantitative predictor variable Interpreting the significance of the slope of a predictor variable Interpreting the model significance from an Analysis of Variance Table

Jin recognizes that unemployment is measured as a quantitative variable. The three predictor variables in the article are the number of senior citizens, the number of high school graduates, and the number of businesses in the town as possible predictors of unemployment. Jin recognizes some of the output as regression output, however in this format, there are multiple lines of output, one for each of the predictors.

NEED some output here

We can show Jin that each of the rows of output contains the same information as he saw in simple linear regression with one predictor. Each line contains the slope for the predictor (B1, B2, B3) and the significance of the slope. We remind Jin that the null hypothesis is the same for all three (that the slope is zero), and the regression technique uses a t-test to test the significance of the slope.

The difference with the multiple linear regression in this example is that each coefficient has a slightly different interpretation. When interpreting any one of the coefficients, we assume the other variables are held constant. Therefore we conclude that the change in unemployment changes as a function of the number of high school graduates, holding the number of senior citizens and the number of businesses constant.

Coefficients
Predictor Coef SE Coef T-Value P-Value VIF
Constant 4.285 0.824 5.20 0.000  
Number of Seniors 0.06033 0.00870 6.93 0.000 1.49
Number of Businesses -0.1315 0.464 -2.84 0.005 1.48
Number of High School Graduates -0.000348 0.000148 -2.36 0.019 1.07

The F test also has a slightly different interpretation. In simple linear regression, the F test was the test that the slope was zero (just like the t test). As we learned with only one predictor, the t and f tests always came out the same. In Jin’s article, the F test is still testing the beta only now the alternative hypothesis is that at least one of the betas is not zero.

Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 3 1795.0 598.34 16.95 0.000
Number of Seniors 1 1696.5 1696.54 48.07 0.000
Number of Businesses 1 284.2 284.15 8.05 0.005
Number of High School Graduates 1 196.9 196.89 5.58 0.019
Error 318 11223.5      
Total 321 13018.5      

Other than that, the assumptions are the same, so Jin will have a good understanding of how to interpret this advanced form of regression!


11.2 - Factor Analysis

11.2 - Factor Analysis

 Case-Study: Second Article

Jin’s second article uses a survey to ask town residents about their perceptions of unemployment. The survey contains over 60 items and reports overall attitudes about unemployment long with subscales measuring perceived locus of control, the importance of sustainability, and optimism about the future. The article reports the overall score descriptive statistics along with descriptive statistics for the three subscales. Jin recognizes the descriptive statistics but wonders how the research arrived at the three subscales based on the 60 original times. Let’s help Jin better understand how the authors of the article moved from a 60 item survey to the three subscales.

Foundational Concepts: The key foundational concepts this article builds upon are: Covariance (correlations) among quantitative variables Descriptive statistics

When researchers, particularly those who use surveys, encounter many variables in their data, such as those produced by the 60 items survey in Jin’s article, they need to find ways to reduce the number of variables. Presenting the means, standard deviations, and graphing each of them is overwhelming, plus it often inflates the number of predictor variables in the model!

The idea of “reducing” the number of variables is grounded in the concept of creating reliability. While reliability is not something this course focuses on, it basically means that the items will measure similar ideas the same way. In Jin’s example, the survey might have three items measuring locus of control (how empowered people feel to take action to not be unemployed), yet each item asks the question a different way. For example, these are three questions from the survey in Jin’s article.

  1. There are enough jobs in this town I am qualified to apply for
  2. I have the skills that would qualify me for jobs in this town
  3. Industry in this town is related to my skillset

If you have ever taken a survey you may have perceived the survey asking very similar questions as the three questions related to empowerment do. This is done on purpose. The researcher can now analyze the questions to see how they relate to each other. This is done through a technique called factor analysis.

As a general overview factor analysis uses the correlations (actually the covariance) of items with one another. Items that co-vary can be grouped together into “factors”. This reduces the number of variables a researcher needs to include in any analysis. So instead of 60 items, the researchers in Jin’s article now only have to deal with three (the subscale scores) and the overall score! Jin is very happy now that he understands that in reading this article, he really only needs to focus on these three subscales, making it much easier to understand the perceptions of unemployment.


11.3 - Repeated ANOVA

11.3 - Repeated ANOVA

 Case-Study: Third Article

Jin’s third article looks at three different size regions (small town, suburb, and city) over the past 10 years. The article records the percentage of total land put aside for recreational use for each area each year. The article concludes that suburbs have been the most successful at setting aside recreational land. Jin is a little confused because he recognizes the article using an ANOVA, where the three different size regions are the independent variable, but doesn’t understand how the researchers included the 10 year perspective.

Foundational Concepts: The key foundational concepts this article builds upon are:

  • One-way and two-way (factorial)
  • ANOVA Assumption of independence from
  • ANOVA Paired and independent t tests
  • Covariance

To understand this idea of a time element, we need to return back to the statistical concept of independence we saw in the assumptions for linear models as well as paired t tests. Independence means that two measures are not related. This is primarily a function of the research methods used to collect data. Through random selection, I should be able to assume measures are independent, we can think of drawing numbers out of a hat, the second number should be independent of the first.

However, in the real world, we often encounter research were the concept of independence cannot be applied, Jin’s article is such an example. We can see from the data presented below, each region has multiple entries (time 1, time 2, time 3). Whenever you are taking measurements from the same unit over time, the measures are no longer independent. In Jin’s example, the percentage of land put aside for Shamrock Town at time 1 is related to the percentage of land put aside for time 2. It is assumed that local beliefs about land use, demand for land use, the people living in the town, at the 2 time points will remain relatively constant.

time Region percent land
1 0 11
1 1 26
1 2 20
2 0 56
2 1 83
2 2 71
3 0 15
3 1 34
3 2 41
4 0 6

While normally violating this assumption of normality would invalidate a technique such as a general linear model, modern computing and awareness of the research design to collect the data allows researchers, like the ones authoring Jin’s article, to properly account for these “repeated” non-independent measures by using a “Repeated Measure ANOVA”.

Like the factor analysis, the repeated measure ANOVA actually focuses on the covariance of each repeated observation within a unit. In Jin’s example, the repeated ANOVA takes into account the covariance of the time 1 and time 2 measurements for Shamrock Town. By taking this relationship into account, the model can appropriately compensate for the violation of independence and appropriately calculate the best fit. The snippet of output demonstrates how Minitab takes into account the “time” variable, however the output of interest is the “test of fixed effects” for the “region” variable (just like the one-way ANOVA output!

Variance Components
Source Var % of Total SE Var Z-Value P-Value
time 590.222222 91.43% 497.088329 1.187259 0.118
Error 55.333333 8.57% 31.946715 1.732051 0.042
Total 645.555556        
Test of Fixed Effects
Term DF Number DF Den F-Value P-Value
Region 2.00 6.00 7.88 0.021

While we have not gone into depth about repeated measure ANOVA, Jin now has an understanding of why this is slightly different than the one-way ANOVA and proceed to interpret the significance of different regions on land put aside for recreational use over time.


11.4 - Non-Parametric

11.4 - Non-Parametric

 Case-Study: Fourth Article

Jin is surprised in reading his final article. The article compares two segments of town residence, senior citizens and teenagers measuring the number of times they violate mandated recycling requirements. Jin has learned enough statistics to understand that comparing two independent groups with a quantitative response variable requires a two sample t test for independent groups, however, this article reports using a Mann Whitney U test. Can we help Jin understand why the author’s use a Mann Whitney U?

Foundational Concepts: The key foundational concepts this article builds upon are:

  • Two independent sample t test
  • Normality assumption for t test
  • Normality and non-normality
  • Median and mean

Reading further into Jin’s article, we see that there were only 10 responses for each group. When the researcher looked at a histogram of the responses, they found that the responses were not normally distributed.

Pointing out the small number of responses and a non-normal distribution, Jin realizes that the assumption of normality for the t test is not met. We can explain to Jin that the Mann Whitney is a type of nonparametric test that does not require the assumption of normality. We also let Jin know that the Mann Whitney is one of many nonparametric alternatives to the many techniques in our introductory statistics calls (such as Fishers exact for chi square, Krusko Wallis for ANOVA, and one sample Wilcoxen for one sample t tests).

The good news is that the interpretation of most of the nonparametric tests is very similar to that of the parametric tests. Significant differences exist when test statistics are large and p values are small. Jin just has to be careful in noting that the nonparametric tests typically refer to the median, not the mean as the measure of central tendency being compared.

Mann-Whitney: Teenager, Senior
Method

\(\eta_1 \colon \) Median of Teenager

\(\eta_2 \colon \) Median of Senior

Difference: \(\eta_1 - \eta_2 \)

Descriptive statistics
Sampl N Median
Constant 10 5.5
Senior 10 11.5
Estimation for Difference
Difference CI for Difference Achieved Confidence
-5 (-7, -2) 95.48%
Test

Null Hypothesis: \(H_0\colon \eta_1 -\eta_2 = 0\)

Alternate Hypothesis: \(H_1\colon \eta_1 -\eta_2 \ne 0\)

Method W-Value P-Value
Not Adjusted for ties 63.00 0.002
Adjusted for ties 63.00 0.00

11.5 - Lesson Summary

11.5 - Lesson Summary

Summary

 Case-Study

From these four articles Jin realizes how much he already knows about basic statistics, but he also realizes there are many applications of the foundational knowledge and many more complex variations. The good news is he has the foundation to better understand his research and inform his studies moving forward.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility