Statistical Inference and Estimation

Review of Introductory Inference

Key Concepts:

Sampling distribution & Central Limit Theorem

Basic concepts of estimation:

Statistical model
Point estimation
Confidence intervals
Hypothesis testing

Review of Introductory Inference

t-test
Nonparametric alternative - sign test

Statistical Inference, Model & Estimation

Recall, a statistical inference aims at learning characteristics of the population from a sample; the population characteristics are parameters and sample characteristics are statistics.

A statistical model is a representation of a complex phenomena that generated the data.

It has mathematical formulations that describe relationships between random variables and parameters.
It makes assumptions about the random variables, and sometimes parameters.
A general form: data = model + residuals
Model should explain most of the variation in the data
Residuals are a representation of a lack-of-fit, that is of the portion of the data unexplained by the model.

Estimation represents ways or a process of learning and determining the population parameter based on the model fitted to the data.

Point estimation and interval estimation, and hypothesis testing are three main ways of learning about the population parameter from the sample statistic.

An estimator is particular example of a statistic, which becomes an estimate when the formula is replaced with actual observed sample values.

Point estimation = a single value that estimates the parameter. Point estimates are single values calculated from the sample

Confidence Intervals = gives a range of values for the parameter Interval estimates are intervals within which the parameter is expected to fall, with a certain degree of confidence.

Hypothesis tests = tests for a specific value(s) of the parameter.

In order to perform these inferential tasks, i.e., make inference about the unknown population parameter from the sample statistic, we need to know the likely values of the sample statistic. What would happen if we do sampling many times?

We need the sampling distribution of the statistic

It depends on the model assumptions about the population distribution, and/or on the sample size.
Standard error refers to the standard deviation of a sampling distribution.

Height Example

We are interested in estimating the true average height of the student population at Penn State. We collect a simple random sample of 54 students. Here is a graphical summary of that sample.

Height example plot

Parameter of interest is the population mean height, μ.
Sample statistic, or a point estimator is \(\bar{X}\), and an estimate, which in this example, is 66.432.
What is the sampling distribution of \(\bar{X}\)?

Central Limit Theorem

Sampling distribution of the sample mean:

If numerous samples of size n are taken, the frequency curve of the sample means ( \(\bar{X}\)‘s) from those various samples is approximately bell shaped with mean μ and standard deviation, i.e. standard error \(\bar{X}/ \sim N(\mu , \sigma^2 / n)\)

Holds if:

X is normally distributed
X is NOT normal, but n is large (e.g. n >30) and μ finite.
For continuous variables

For categorical data, the CLT holds for the sampling distribution of the sample proportion.

Proportions in Newspapers

As found in CNN in June, 2006:

CNN survey

The parameter of interest in the population is the proportion of U.S. adults who disapprove of how well Bush is handling Iraq, p.

The sample statistic, or point estimator is \(\hat{p}\), and an estimate, based on this sample is \(\hat{p}=0.62\).

Next question ...

CNN survey

If we take another poll, we are likely to get a different sample proportion, e.g. 60%, 59%,67%, etc..

So, what is the 95% confidence interval? Based on the CLT, the 95% CI is \(\hat{p}\pm 2 \ast \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\).

We often assume p = 1/2 so \(\hat{p}\pm 2 \ast \sqrt{\frac{\frac{1}{2}\ast\frac{1}{2} }{n}}=\hat{p}\pm\frac{1}{\sqrt{n}}=\hat{p}\pm\text{MOE}\).

The margin of error (MOE) is 2 × St.Dev or \(1/\sqrt{n}\).