Statistics can be thought of as a whole subject or discipline ...
It can be thought of as the methods used to collect, process and/or interpret data ...
It can be thought of as the collections of data gathered by those methods ...
It can also be thought of as a specially calculated figures (e.g. averages) to characterize collection ...
Consider how the word statistics is used in the following paragraph to imply these different meanings of the word:
"A student in a class offered by the PSU Statistics department uses statistics (statistical methods) to interpret statistics (data) about the cost of a 1 bedroom apartment in State College, and he/she may summarize finding by quoting a statistics of 'average price per 10 apartments' in various locations of State College."
Statistics is the science and art of making decisions based on quantitative evidence.
Almost all fields of study collect and interpret the data. In statistics variability is a key concept. Statistics (and statisticians) recognize that not all things/people/units/etc are exactly alike.
Most statisticians (and you) are involved in:
The objective of descriptive statistics methods is to summarize a set of observations. The objective of inferential statistics methods is to make inferences (predictions, decisions) about population based on information contained in a sample, and to quantify the level of uncertainty in our decisions. 
Example
Question: What is the cost of a 1 bedroom apartment in State College?
Approach the problem: At the end of the spring semester you randomly picked 15 people who told you how much rent they pay for a 1 bedroom apartment. You also record the location of the apartment. Is this a survey, experiment, or ….?
Here is the data that was gathered:
\$280

\$320

\$320

\$330

\$340

\$370

\$370

\$375

\$380

\$380

\$380

\$390

\$420

\$420

\$430

Some descriptive statistics:
Mean: \$5505 / 15 = \$367, Standard deviation = 42.07986
Median: \$375, seven values above and below
Mode: \$380, it appears three times
Observations [1] 
Parameter 
All important, yet often confused definitions:
A valid measurement actually measures what it claims to measure. A reliable measurement provides every experiementer the same result in successive trials. A biased measurement will be wrong in the same direction nearly every time. Variability is the difference in successive measurements of the same thing. Natural variability means that we are all different 
Observational units are entities whose characteristics we measure.

Population: the entire collection of units about which we would like information. ( e.g. all of the 1 bedroom apartments in State College). Sample: the collection of units we actually measure ( e.g. 15 of the 1 bedroom apartments in State College). Parameter: the true value we hope to obtain ( e.g. true average cost of 1 bedroom apartment in State College). Statistic: an estimate of the parameter based on observed information in the sample ( e.g. average price of 15 sampled 1 bedroom apartments). 
Parameters are generally unknown so we estimate them with sample statistics.
Random variables are characteristics of the observational units which can have different possible values (this is the practical, not the statistical definition) 
There are two different types of random variables:
Quantitative (numerical, measurement) variables represent an amount or quantity of something (e.g. time spent waiting for the bus) Qualitative (categorical) variables represent things that can be categorized (e.g. the colors of the cars that pass while you wait for the bus) 
Letters like X or Y represent random variables if its value is not known before the experiment is run.
How about our example? The price of the apartment? Or, location of the apartment?
Discrete random variables can only take on values from a countable set of numbers such as the integers or some subset of integers. (Usually, they can’t be fractions.) Continuous random variables can take on any real number in some interval. (They can be fractions.) 
Note: We consider variables like height to be continuous even though we can only measure them in discrete units (e.g. millimeters).
Nominal (unordered) random variables have categories where order doesn’t matter. Ordinal (ordered) random variables have ordered categories. ( e.g. grade levels, income levels, school levels, ...) 
The explanatory variable attempts to explain (or is purported to cause) differences in a response variable (or outcome variable), (e.g. homework scores and exam scores can be explanatory variables for the final grade).
But in order to make inferences from a sample to a population the sample needs to be representative. How do we insure that? RANDOMIZATION!!
Key concepts:
Randomization
Sampling 
Observational study
Randomized experiment 
When we want to learn the characteristics of a population, we can:
To insure the sample is representative, that is that we can learn desired characteristics about the population from the sample, we need to use a random mechanism to select the sample.
So how can this be done?
Randomized Experiment: here we create differences in the explanatory variable and then examine the results:
 The investigators applies one or more manipulations (i.e. treatments) to the experimental subjects
 Subjects are randomly assigned to treatments
Observational Study: here we observe differences in the explanatory variables
 e.g. survey data
The KEY for both is Randomization! (In the 1 bedroom data example we did a kind of a survey.)
Simple random sampling
 Sample of size n from a population of size N
 Equal probability of selection
Stratified random sampling
 Select a random sample from each strata
 e.g. proportional allocation
 Reduces error
Cluster random sampling
 Select a random sample from each cluster
 Reduces cost, but increases error
Systematic random sampling
 Simple random sampling in multiple
 Simple design and administration
You can make statements of causal inference from randomized experiments. Nowadays new statistical methods are being developed for making causal inference statements from observational studies too!
Major problem: Confounding
Right Now Exercise!TelephoneTelepathy Yahoo news article [4] Read the 1 page article above then define the population, sample, observational unit, parameter, and statistic. Is this an observational study or an experiment? Why? What is the major finding? Understand the problem
Identify the question
For more information on this research topic visit the researchers web site [5]. 
"A picture is worth a 1000 words!"
Key Concepts: Displaying data Just as with NonGraphical EDA, Graphical EDA has the same four points as a focal point. These are:

The distribution of a variable tells us what values the variable takes and how often each value occurs.
Quantitative Line graph across time 1 variable: Histograms, Boxplots, Stem and Leaf plots, Quantile normal plot 2 variables: Scatterplots 
Categorical 1 variable: Pie charts, Bar graphs 2 or more: Bar graphs, Pictograms, Contingency Tables 
Categorical & Quantitative: boxplot

In welldesigned displays, the data should clearly stand out. Graphs should show clear labeling indicating:
Every display should state the source of the data, and include as little extraneous material as possible.
The first boxplot looks like that for a normal distribution. The second shows skew to the left. The third has some outliers (unusual observations). 
The edges correspond to Q1 and Q3. The line in the middle represents the median. The ends of "whiskers" indicate the MIN and MAX values, unless there are outliers. Stars represent the outliers (1.5 x IQR below Q1 and above Q3).
StemandLeaf Plot The decimal point is 2 digit(s) to the right of the 


Histogram  
Boxplot


Questions:
Suppose we observed the height of twenty students to be:
60,68,69,64,68,67,68,69,77,69,69,72,69,65,65,68,64,71,74,74
The variable is height. The sample could be the 20 people or the 20 numbers, depending on your point of view. The sample size is n = 20. The mean and the median are both about 68.5 inches. The standard deviation is about 3.9 inches.
A boxplot is another good way to look at the shape of a distribution.
The following is a frequency histogram for the height data.
A relative frequency histogram is similar but uses proportions instead of counts.
What do we get if we draw a smooth curve over our histogram? If your sample is large enough, a relative frequency histogram will give a rough indication of the characteristics of the whole population. Height of the curve is NOT proportion or frequency any more. This curve must satisfy: area under the curve equals 1.
Nongraphical exploratory data analysis is the first step when beginning to analyze your data as part of the general data analysis approach.
This preliminary data analysis step focuses on four points, i.e. for mechanisms that you will want to examine. These include:
The following pages in this overview review these topics.
Nongraphical exploratory data analysis may be followed by, or be engaged in concurrently with graphical exploratory data analysis [10]. Graphical EDA involves visual exploratory analysis of the data.
Measures of central tendency are some of the most basic and useful statistical functions. They summarize a sample or population by a single typical value.
The two most commonly used measures of central tendency for numerical data are the mean and the median.
Mean: The average of all data points Median: The data point where half of the data lies above and half below it Mode: The most common value in the data 
Synonyms: average, arithmetic mean.
Gives an “expected value” (not literally)
The sample mean, written as \( \bar{X}\), equals the sum of observations divided by the size of the sample.
\[\bar{X}=\frac{\sum_{i=1}^{n}X_i}{n}\]
The population mean, written as μ, is analogous to the sample mean, but for the whole population.
So for our 1 bedroom apartment data: \( \bar{X}\) = \$5505 / 15 = \$367
Also known as 50th percentile, the sample median is the middle number (or the arithmetic mean of the two middle numbers in the case of an even number of observations) when the observations are written out in order.
The population median is the 50th percentile in the whole population.
Order the values from the smallest to the largest, and find the median. For our 1 bedroom appartment example: \$375 is the median; there are seven values above and below this value.
DIFFERENCES The mean is somewhat more “mathematically tractable” (works better with some statistical procedures). The median is more resistant to outliers. The median and mean have slightly different interpretations. 
SIMILARITIES Both tell about where the “typical” or “central” value in a distribution is found. For a symmetric distribution such as the normal distribution, the mean and the median are the same number. 
Example of Resistance to Outliers
The mean of 3, 4, 6, 7, 8, 10, 15 is about 7.57.
The mean of 3, 4, 6, 7, 8, 10, 150 is about 26.86.
The median of either data set is 7.
Most statisticians would say that in this situation the median is the better measure of central tendency to use. (Of course, it is best to report both.)
So, let's say that I tell you the average rent is \$360. Does that mean you should expect all 1 bedroom apartments to rent close to \$360?
Synonyms: Dispersion, Spread
One of the basic themes of statistics. In general, not every member of a population or every outcome of a process has the same score on a variable of interest. Measures of central tendency like the mean are incomplete in themselves because they don’t tell how spread out the data is around the center. All statistical inference must take variability into account.
The (signed) deviation for an observation is the difference between that observation and the mean. That is, the deviation for the ith observation is: \( X_i  \bar{X} \) . 
The arithmetic mean of all of the deviations must be zero because they sum to zero. However, the arithmetic mean of the absolute values of the deviations (the Mean Absolute Deviation) or their median (the Median Absolute Deviation).
For mathematical and historical reasons, however, we usually use the average of the squares of the deviations rather than their absolute values. Recall that a square is never negative.
For instance in our 1 bedroom apartment example, the deviation of the 2nd value from the mean is: \$320  \$360 = \$40.
The sample variance, abbreviated s^{2}, is a commonly used measure of variability. It is approximately the mean of the squares of the deviations. \[ s^2= \text{Variance }=\frac{\sum_{i=1}^{n}(X_i\bar{X})^2}{n}\text{or } \frac{\sum_{i=1}^{n}(X_i\bar{X})^2}{n1}\] 
The population variance, abbreviated σ^{2}, is similar but measures the variability of the whole population. So, σ^{2} would be the mean of the squared deviations of all the members of the whole population.
Sample variance for our 1 bedroom apartment example is \$1770.714
The standard deviation is the square root of the variance (to recover original scale of measurement). The standard deviation of a sample is: \[ s=\text{sqrt (Variance) }= \sqrt{\frac{\sum_{i=1}^{n}(X_i\bar{X})^2}{n1}} \] 
The standard deviation of a population is written as σ. This can be thought of as roughly the average distance of the observed values from the mean.
The standard deviation for our 1 bedroom apartment example is \$42.07986
Range:
The lowest and highest values
1 bedroom apartment example: \$280, \$340
IQR: InterQuartile Range
Interquartile range is the distance between the 75th and 25th percentile. It’s essentially the middle 50% of the data.
1 bedroom apartment example:
 0.25 * (15 + 1) = 4, \$330
 0.75 * (15 + 1) = 12, \$390
 IQR: \$390  \$330
Shape of distribution: Populations with the same mean and standard deviation can still have distributions with very different shapes.
Symmetry
 bell shaped or normal
 uniform
Skewness
 skewed to the right (skewed positively)
 skewed to the left (skewed ??? )
Modality  # of prominent peaks
 unimodal
 bimodal
Outliers
 they affect the mean!
A histogram, provides a picture of the pattern of scores in a certain sample, gives us a good way to estimate the distribution of scores in the population.
Which distribution do you know that is symmetric?
An outlier is an observation that lies "far away" from other values in a random sample from a population. What constitutes "far away" is often up to the analyst given the "normal" observations look like and the context of the problem. They may be an indicator of data errors or a rare events, and should be investigated carefully to understand why they appear in our sample and they are influential or not. The analysis should be run with and without them, and if no reasonable explanation can be attached for their existence before elimination, the results of both analysis should be reported.
While doing the EDA a quick visual way to check for the outliers for continuous data is via scatterplots and boxplots [10]. In particular, calculating the interquartile range (IQR) and using its multiples can help us define the outliers. The IQR = Q3  Q1, where Q1 is the first quartile, and Q3 the third quartile. The potential outliers lie outside the range of:
[Q1  (1.5 × IQR), Q3 + (1.5 × IQR)],
and problematic outliers lie outside of:
[Q1  (3 × IQR), Q3 + (3 × IQR)].
While fitting models such as linear regression, we also use the residuals plots and analysis to address potential influential points, e.g. leverage plots, Cook's distance, etc...
The following topics are reviewed in this section. This is a bit more technical review. You can go through this now, or proceed to Lesson 1 [11], and come back to these items as needed.

Use the links to online notes below to review these topics.
The basic problem we study in probability:
Given a data generating process, what are the properties of the outcomes?
The basic problem of statistical inference:
Given the outcomes, what can we say about the process that generated the data?
(ref: Wasserman(2004))
For example:
1) Given that we know (assume) that the IQ test is scored so that it has a normal distribution with the mean score of 100 and a standard deviation of 15, what percent of the population has a score higher than 130?
2) Given that we collect IQ score data on 1000 randomly chosen individuals, what is our best guess estimate of the mean and the variance of the normal distribution in the underlying population that we believe generated the data.
The sample space Ω is the set of possible outcomes of an experiment. Points ω in Ω are called sample outcomes, realizations, or elements. Subsets of Ω are called events.
An event is denoted by a capital letter near the beginning of the alphabet (A, B, . . .). The probability that A occurs is denoted by P(A).
If we toss a coin twice then Ω = {HH, HT, TH, TT}. The event that the first toss is heads is A = {HH, HT}.
Probability satisfies the following elementary properties, called axioms; all other properties can be derived from these.
More generally, if A and B are any events then
P(A or B) = P(A) + P(B) − P(A and B). (1)
If A and B are mutually exclusive, then P(A and B) = 0 and (1) reduces to axiom 3.
If B is known to have occurred, then this knowledge may affect the probability of another event A. The probability of A once B is known to have occurred is written P(AB) and called “the conditional probability of A given B,” or, more simply, “the probability of A given B.” It is defined as
P(AB) = P(A and B)/P(B) (2)
provided that P(B) ≠ 0.
The events A and B are said to be independent if
P(A and B) = P(A) P(B). (3)
By (2), this implies P(AB) = P(A) and P(BA) = P(B). Intuitively, independence means that knowing A has occurred provides no information about whether or not B has occurred and viceversa.
A random variable is the outcome of an experiment (i.e. a random process) expressed as a number. We use capital letters near the end of the alphabet (X, Y , Z, etc.) to denote random variables. Random variables are of two types: discrete and continuous.
Continuous random variables are described by probability density functions (PDF). For example, a normally distributed random variable has a bellshaped density function like this:
The probability that X falls between any two particular numbers, say a and b, is given by the area under the density curve f(x) between a and b,
\[ P (a \le X \le b) = \int_{a}^{b}f(x)dx\].
The two continuous random variables that we will use most will either have the Normal distributions or the χ^{2} (chisquared) distributions. Areas under the normal and χ^{2} density functions for calculations of pvalues are tabulated and widely available in textbooks. They can also be computed with statistical computer packages.
The ChiSquared Distribution
The "degreesoffreedom" (df), completly specify a chisquared distribution. Here are the properties of a chisquared random variable:
Here is a plot of different chisquared distributions.
The plot was created in R using the following code [12]; in general I find using R much simpler for creating plots like these. You can view a related image in Section 2.3.1 (Agresti (2007)), and check out the Wiki [13] page on the chisquare distribution.
Discrete random variables are described by probability mass functions (PMF), which we will also call “distributions.” For a random variable X, we will write the distribution as f(x) and define it to be:
\( f(x) = P(X = x)\).
In other words, f(x) is the probability that the random variable X takes the specific value x. For example, suppose that X takes the values 1, 2, and 5 with probabilities 1/4, 1/4, and 1/2 respectively. Then we would say that f(1) = 1/4, f(2) = 1/4, f(5) = 1/2, and f(x) = 0 for any x other than 1, 2, or 5:
\(f(x)= \begin{cases}
.25 & x=1, 2 \\
.50 & x=5 \\
0 & \text{otherwise}
\end{cases}\)
A graph of f(x) has spikes at the possible values of X, with the height of a spike indicating the probability associated with that particular value:
Note that Σ_{x} f(x) = 1 if the sum is taken over all values of x having nonzero probability. In other words, the sum of the heights of all the spikes must equal one.
Suppose that X_{1}, X_{2}, . . . , X_{n} are n random variables, and let X be the entire vector
X = (X_{1}, X_{2}, . . . , X_{n}).
Let x = (x_{1}, x_{2}, . . . , x_{n}) denote a particular value that X can take. The joint distribution of X is
f(x) = P(X = x) = P(X_{1} = x_{1}, X_{2} = x_{2}, . . . , X_{n} = x_{n}).
In particular, suppose that the random variables X_{1}, X_{2}, . . . , X_{n} are independent and identically distributed (iid). Then X_{1} = x_{1}, X_{2} = x_{2}, . . . , X_{n} = x_{n} are independent events, and the joint distribution is
\begin{align}
f(x) &= P(X_1=x_1, X_2=x_2, \ldots, X_n=x_n)\\
&= P(X_1=x_1) P(X_2=x_2) \ldots P(X_n=x_n)\\
&= f(x_1)f(x_2)\ldots f(x_n)\\
&= \prod\limits^n_{i=1}f(x_i)\\
\end{align}
where f(x_{i}) refers to the distribution of X_{i}.
The expectation (mean or the first moment) of a discrete random variable X is defined to be:
\[E(X)=\sum_{x}xf(x)\]
where the sum is taken over all possible values of X. E(X) is also called the mean of X or the average of X, because it represents the longrun average value if the experiment were repeated infinitely many times.
In the trivial example where X takes the values 1, 2, and 5 with probabilities 1/4, 1/4, and 1/2 respectively, the mean of X is
\(E(X) = 1(.25) + 2(.25) + 5(.5) = 3.25\).
In calculating expectations, it helps to visualize a table with two columns. The first column lists the possible values x of the random variable X, and the second column lists the probabilities f(x) associated with these values:
x

f(x)

1

.25

2

.25

5

.50

To calculate E(X) we merely multiply the two columns together, row by row, and add up the products: 1(.25) + 2(.25) + 5(.5) = 3.25.
If g(X) is a function of X (e.g. g(X) = logX, g(X) = X_{2}, etc.) then g(X) is also a random variable. Its expectation is
\(E(g(X))=\sum_{x}g(x)f(x)\) (4)
Visually, in the table containing x and f(x), we can simply insert a third column for g(x) and add up the products g(x)f(x). In our example, if Y = g(X) = X_{3}, the table becomes
x

f(x)

g(x) = x^{3}

1

.25

1^{3} = 1

2

.25

2^{3} = 8

5

.50

5^{3} = 125

and
\(E( Y ) = E(X^3) = 1(.25) + 8(.25) + 125(.5) = 64.75\).
If Y = g(X) = a + bX where a and b are constants, then Y is said to be a linear function of X, and E( Y ) = a + bE(X). An algebraic proof is
\begin{align}
E(Y)&=\sum\limits_y yf(y)\\
&= \sum\limits_x (a+bx)f(x)\\
&= \sum\limits_x af(x)+\sum\limits_x bxf(x)\\
&= a\sum\limits_x f(x)+b\sum\limits_x xf(x)\\
&= a\cdot1+bE(X)\\
\end{align}
That is, if g(X) is linear, then E(g(X)) = g(E(X)). Note, however, that this does not work if the function g is nonlinear. For example, E(X^{2}) is not equal to E(X)^{2}, and E(logX) is not equal to logE(X). To calculate E(X^{2}) or E(logX), we need to use expression (4).
The variance of a discrete random variable, denoted by V (X), is defined to be
\begin{align}
V(X)&= E((XE(X))^2)\\
&= \sum\limits_x (xE(X))^2 f(x)\\
\end{align}
That is, V (X) is the average squared distance between X and its mean. Variance is a measure of dispersion, telling us how “spread out” a distribution is. For our simple random variable, the variance is
\(V (X) = (1− 3.25)^2 (.25) + (2 − 3.25)^2 (.25) + (5 − 3.25)^2 (.50) = 3.1875\).
A slightly easier way to calculate the variance is to use the wellknown identity
\(V (X) = E(X^2) − (E(X) )^2\).
Visually, this method requires a table with three columns: x, f(x), and x^{2}.
x

f(x)

x^{2}

1

.25

1^{2} = 1

2

.25

2^{2} = 4

5

.50

5^{2} = 25

First we calculate
E(X) = 1(.25) + 2(.25) + 5(.50) = 3.25 and
E(X^{2}) = 1(.25) + 4(.25) + 25(.50) = 13.75. Then
V (X) = 13.75 − (3.25)2 = 3.1875.
It can be shown that if a and b are constants, then
\(V (a + bX) = b^2V (X)\).
In other words, adding a constant a to a random variable does not change its variance, and multiplying a random variable by a constant b causes the variance to be multiplied by b^{2}.
Another common measure of dispersion is the standard deviation, which is merely the positive square root of the variance,
\(SD(X) = \sqrt{V(X)}\)
Expectation is always additive; that is, if X and Y are any random variables, then
\(E(X + Y ) = E(X) + E( Y )\).
If X and Y are independent random variables, then their variances will also add:
\(V (X + Y) = V (X) + V ( Y )\) if X, Y independent.
More generally, if X and Y are any random variables, then
\(V (X + Y) = V (X) + V ( Y ) + 2Cov(X, Y )\)
where Cov(X, Y ) is the covariance between X and Y,
\(Cov(X, Y ) = E( (X− E(X)) ( Y − E( Y )) )\).
If X and Y are independent (or merely uncorrelated) then Cov(X, Y ) = 0. This additive rule for variances extends to three or more random variables; e.g.,
\(V (X + Y + Z) = V (X) + V ( Y ) + V (Z) +2Cov(X, Y ) + 2Cov(X, Z) + 2Cov(Y, Z)\)
with all covariances equal to zero if X, Y , and Z are mutually uncorrelated.
Key Concepts: Sampling distribution & Central Limit Theorem Basic concepts of estimation:
Review of Introductory Inference

Recall, a statistical inference aims at learning characteristics of the population from a sample; the population characteristics are parameters and sample characteristics are statistics.
A statistical model is a representation of a complex phenomena that generated the data.
Estimation represents ways or a process of learning and determining the population parameter based on the model fitted to the data.
Point estimation and interval estimation, and hypothesis testing are three main ways of learning about the population parameter from the sample statistic.
An estimator is particular example of a statistic, which becomes an estimate when the formula is replaced with actual observed sample values.
Point estimation = a single value that estimates the parameter. Point estimates are single values calculated from the sample
Confidence Intervals = gives a range of values for the parameter Interval estimates are intervals within which the parameter is expected to fall, with a certain degree of confidence.
Hypothesis tests = tests for a specific value(s) of the parameter.
In order to perform these inferential tasks, i.e., make inference about the unknown population parameter from the sample statistic, we need to know the likely values of the sample statistic. What would happen if we do sampling many times?
We need the sampling distribution of the statistic
Height ExampleWe are interested in estimating the true average height of the student population at Penn State. We collect a simple random sample of 54 students. Here is a graphical summary of that sample.

Sampling distribution of the sample mean:
If numerous samples of size n are taken, the frequency curve of the sample means ( \(\bar{X}\)‘s) from those various samples is approximately bell shaped with mean μ and standard deviation, i.e. standard error \(\bar{X}/ \sim N(\mu , \sigma^2 / n)\)
Holds if:
For categorical data, the CLT holds for the sampling distribution of the sample proportion.
As found in CNN in June, 2006:
The parameter of interest in the population is the proportion of U.S. adults who disapprove of how well Bush is handling Iraq, p.
The sample statistic, or point estimator is \(\hat{p}\), and an estimate, based on this sample is \(\hat{p}=0.62\).
Next question ...
If we take another poll, we are likely to get a different sample proportion, e.g. 60%, 59%,67%, etc..
So, what is the 95% confidence interval? Based on the CLT, the 95% CI is \(\hat{p}\pm 2 \ast \sqrt{\frac{\hat{p}(1\hat{p})}{n}}\).
We often assume p = 1/2 so \(\hat{p}\pm 2 \ast \sqrt{\frac{\frac{1}{2}\ast\frac{1}{2} }{n}}=\hat{p}\pm\frac{1}{\sqrt{n}}=\hat{p}\pm\text{MOE}\).
The margin of error (MOE) is 2 × St.Dev or \(1/\sqrt{n}\).
A statistical model is a representation of a complex phenomena that generated the data.
In models, the focus is on estimating the model parameters. The basic inference tools (e.g., point estimation, hypothesis testing, and confidence intervals) will be applied to the these parameters. When discussing models, we will keep in mind the following parts:
State what the objective is for this model. For instance, "Estimate the probability that a characteristic is present given the value of the explanatory values are ... "
State the important variables in the model. What is the response variable Y? What is included in the set of explanatory variables?
State the Equation for the Model.
Model Assumptions
State the assumptions for the model that you are using. Are the data independently distributed? Do linear relationships exist between the dependent and independent variables? Is the variance homogeneous? Are errors independent and normally distributed?
What are the odds that the objective characteristic is present? What evidence will you use to establish the estimate?
How is goodness of fit determined? Pearson chisquare statistic? Deviance? Likelihood ratio test? What does the analysis of the residuals show?
Choosing a single ”best” model among the presence of more than one reasonable model involves some subjective judgment. We seek a parsimonious model that is as simple as possible and adequately explains the phenomena of interest.
How can we fit a particular family of models in SAS and evaluate different parts of the output?
How can we fit a particular family of models in SAS and evaluate different parts of the output?
Estimation represents ways or a process of learning and determining the population parameter based on the model fitted to the data.
Point estimation and interval estimation, and hypothesis testing are three main ways of learning about the population parameter from the sample statistic.
An estimator is particular example of a statistic, which becomes an estimate when the formula is replaced with actual observed sample values.
Point estimation = a single value that estimates the parameter. Point estimates are single values calculated from the sample
Height ExampleWe are interested in estimating the true average height of the student population at Penn State. We collect a simple random sample of 54 students. Here is a graphical summary of that sample.

Key Concepts:
A confidence interval estimates are intervals within which the parameter is expected to fall, with a certain degree of confidence.
The general form:
For example:
The CIs differ based on:
In most general terms, for a 95% CI, we say “we are 95% confident that the true population parameter is between the lower and upper calculated values”.
A 95% CI for a population parameter DOES NOT mean that the interval has a probability of 0.95 that the true value of the parameter falls in the interval.
The CI either contains the parameter or it does not contain it.
The probability is associated with the process that generated the interval. And if we repeat this process many times, 95% of all intervals should in fact contain the true value of the parameter.
What does a 99% CI say?
Would you choose a 99% or 95% CI, and why?
Tradeoffs
We want confidence coefficient to be closer to 1.
We want the sample size to be as small as possible (but not too small). This is a practical issue.
We want the CI to be as narrow as possible
For an unknown population mean, and a known variance:
Critical value, z /2 is a multiplier for a (1α) × 100%
For 95% CI, α = 0.5, so the Zvalue of the standard normal is at 0.025, that is z = 1.96 For any probability value (1 ) there is a number z/2 such that any normal distribution has probability (1 ) within z /2 standard deviations of the mean. Assuming that σ is known, the multiplier for a (1α) × 100% confidence interval is the (1  ½α) × 100th percentile of the standard normal distribution.
Height Example

As found in CNN in June, 2006 (here) [14]:
Next question ...
The stated Margin of error: +/ 3%
Therefore, this would be the Confidence interval: 62%+/ 3%
We can be really confident that between 59% and 65% of all U.S. adults disapprove of how President Bush is handling the situation in Iraq.
Key Topics:
One sample ztest

It is either likely or unlikely that we would collect the evidence we did given the initial assumption. (Note: “likely” or “unlikely” is measured by calculating a probability!)
If it is likely, then we “do not reject” our initial assumption. There is not enough evidence to do otherwise.
If it is unlikely, then:
In statistics, if it is unlikely, we decide to “reject” our initial assumption.
First, state 2 hypotheses, the null hypothesis (“H_{0}”) and the alternative hypothesis (“H_{A}”)
Usually the H_{0} is a statement of “no effect”, or “no change”, or “chance only” about a population parameter.
While the H_{A} , depending on the situation, is that there is a difference, trend, effect, or a relationship with respect to a population parameter.
Then, collect evidence, such as finger prints, blood spots, hair samples, carpet fibers, shoe prints, ransom notes, handwriting samples, etc. (In statistics, the data are the evidence.)
Next, you make your initial assumption.
In statistics, we always assume the null hypothesis is true.
Then, make a decision based on the available evidence.
If the observed outcome, e.g., a sample statistic, is surprising under the assumption that the null hypothesis is true, but more probable if the alternative is true, then this outcome is evidence against H_{0} and in favor of H_{A}.
An observed effect so large that it would rarely occur by chance is called statistically significant (i.e., not likely to happen by chance).
The pvalue represents how likely we would be to observe such an extreme sample if the null hypothesis were true. The pvalue is a probability computed assuming the null hypothesis is true, that the test statistic would take a value as extreme or more extreme than that actually observed. Since it's a probability, it is a number between 0 and 1. The closer the number is to 0 means the event is “unlikely.” So if pvalue is “small,” (typically, less than 0.05), we can then reject the null hypothesis.
Significance level, α, is a decisive value for pvalue. In this context, significant does not mean “important”, but it means “not likely to happened just by chance”.
α is the maximum probability of rejecting the null hypothesis when the null hypothesis is true. If α = 1 we always reject the null, if α = 0 we never reject the null hypothesis. In articles, journals, etc… you may read: “The results were significant (p<0.05).” So if p=0.03, it's significant at the level of α = 0.05 but not at the level of α = 0.01. If we reject the H_{0} at the level of α = 0.05 (which corresponds to 95% CI), we are saying that if H_{0} is true, the observed phenomenon would happen no more than 5% of the time (that is 1 in 20). If we choose to compare the pvalue to α = 0.01, we are insisting on a stronger evidence!
Very Important Point!Neither decision of rejecting or not rejecting the H_{0} entails proving the null hypothesis or the alternative hypothesis. We merely state there is enough evidence to behave one way or the other. This is also always true in statistics! 
So, what kind of error could we make? No matter what decision we make, there is always a chance we made an error.
Errors in Criminal Trial:
Truth


Jury Decision 
Not Guilty

Guilty

Not Guilty 
OK

ERROR

Guilty 
ERROR

OK

Errors in Hypothesis Testing
Type I error (False positive): The null hypothesis is rejected when it is true.
Type II error (False negative): The null hypothesis is not rejected when it is false.
There is always a chance of making one of these errors. But, a good scientific study will minimize the chance of doing so!
Truth


Decision

Null Hypothesis

Alternative Hypothesis

Null Hypothesis 
OK

TYPE II ERROR

Alternative Hypothesis 
TYPE I ERROR

OK

The power of a statistical test is its probability of rejecting the null hypothesis if the null hypothesis is false. That is, power is the ability to correctly reject H_{0} and detect a significant effect. In other words, power is one minus the type II error risk.
\(\text{Power }=1\beta = P\left(\text{reject} H_0  H_0 \text{is false } \right)\)
Which error is worse?
Type I = you are innocent, yet accused of cheating on the test.
Type II = you cheated on the test, but you are found innocent.
This depends on the context of the problem too. But in most cases scientists are trying to be “conservative”; it's worse to make a spurious discovery than to fail to make a good one. Our goal it to increase the power of the test that is to minimize the length of the CI.
We need to keep in mind:
(see the handout). To study the tradeoffs between the sample size, α, and Type II error we can use power and operating characteristic curves.
Height ExampleOne sample ztest Assume data are independently sampled from a normal distribution with unknown mean μ and known variance σ^{2} = 9. Make an initial assumption that μ = 65. Specify the hypothesis: H_{0}: μ = 65 H_{A}: μ ≠ 65 zstatistic: 3.58 zstatistic follow N(0,1) distribution The pvalue, < 0.0001, indicates that, if the average height in the population is 65 inches, it is unlikely that a sample of 54 students would have an average height of 66.4630. Alpha = 0.05. Decision: pvalue < alpha, thus reject the null hypothesis. Conclude that the average height is not equal to 65. 
What type of error might we have made?
Type I error is claiming that average student height is not 65 inches, when it really is.
Type II error is failing to claim that the average student height is not 65in when it is.
We rejected the null hypothesis, i.e., claimed that the height is not 65, thus making potentially a Type I error. But sometimes the pvalue is too low because of the large sample size, and we may have statistical significance but not really practical significance! That's why most statisticians are much more comfortable with using CI than tests.
Height ExampleGraphical summary of the ztest Based on the CI only, how do you know that you should reject the null hypothesis? The 95% CI is (65.6628,67.2631) ... What about practical and statistical significance now? Is there another reason to suspect this test, and the pvalue calculations? 
There is a need for a further generalization. What if we can't assume that σ is known? In this case we would use s (the sample standard deviation) to estimate σ.
If the sample is very large, we can treat σ as known by assuming that σ = s. According to the law of large numbers, this is not too bad a thing to do. But if the sample is small, the fact that we have to estimate both the standard deviation and the mean adds extra uncertainty to our inference. In practice this means that we need a larger multiplier for the standard error.
We need onesample ttest.
H_{0}: μ = μ_{0}
H_{0}: μ ≤ μ_{0} H_{0}: μ ≥ μ_{0} 
vs. one of those

H_{A}: μ ≠ μ_{0}
H_{A}: μ > μ_{0} H_{A}: μ < μ_{0} 
Let's go back to our CNN poll. Assume we have a SRS of 1,017 adults.
We are interested in testing the following hypothesis: H_{0}: p = 0.50 vs. p > 0.50
What is the test statistic?
If alpha = 0.05, what do we conclude?
We will see more details in the next lesson on proportions, then distributions, and possible tests.
Links:
[1] https://onlinecourses.science.psu.edu/stat504/../node/7
[2] https://onlinecourses.science.psu.edu/stat504/../node/8
[3] https://onlinecourses.science.psu.edu/stat504/../node/6
[4] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson01/telephone_telepathy.pdf
[5] https://www.sheldrake.org/homepage.html
[6] https://onlinecourses.science.psu.edu/stat504/../node/12
[7] https://onlinecourses.science.psu.edu/stat504/../node/13
[8] https://onlinecourses.science.psu.edu/stat504/../node/14
[9] https://onlinecourses.science.psu.edu/stat504/../node/15
[10] https://onlinecourses.science.psu.edu/stat504/../node/10
[11] https://onlinecourses.science.psu.edu/stat504/node/4
[12] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson01/chisqdistributions.R
[13] https://en.wikipedia.org/wiki/Chisquare_distribution
[14] https://www.cnn.com/interactive/allpolitics/0606/poll.iraq2/frameset.exclude.html