# 1: Introduction to Discrete Data

1: Introduction to Discrete Data## Overview

This lesson serves as an introduction to discrete data and many of the popular distributions used to describe it. Usually, we associate discrete data with qualitative characteristics, but as we'll see, ordered or even numerically meaningful categories can also be considered discrete. Numerical summaries and visual displays can likewise be constructed to reflect these properties.

Among the many distributions used for describing discrete data, we focus here mainly on the binomial distribution, which applies to data with exactly two outcomes, and introduce other discrete distributions in relation to the binomial. If the binomial doesn't apply for a particular reason, what might a suitable alternative be?

Finally, we introduce the likelihood function and show how it can be used to estimate a population parameter. Discrete data lends itself particularly well to this concept because the likelihood can be interpreted as the probability of observing our data for a given value of the parameter. The intuitive estimate for the parameter is then the value that maximizes this probability.

#### Lesson 1 Code Files

# 1.1 - Types of Discrete Data

1.1 - Types of Discrete DataObjective 1.2Discrete data is often referred to as categorical data because of the way observations can be collected into categories. Variables producing such data can be of any of the following types:

**Nominal**(e.g., gender, ethnic background, religious or political affiliation)**Ordinal**(e.g., extent of agreement, school letter grades)**Quantitative**variables with relatively few values (e.g., number of times married)

Technically, a quantitative variable may take on any number of values and still be considered discrete, but it needs to be "countable". So, for example, the number of traffic accidents in a given time period may be considered discrete, but the amount of time between two consecutive accidents would be considered continuous. However, even a continuous variable may be used to produce discrete data if its range is divided or "coarsened" into intervals.

Note that many variables can be considered as either nominal or ordinal, depending on the purpose of the analysis. Consider majors in English, psychology, and computer science. This classification may be considered nominal or ordinal, depending on whether there is an intrinsic belief that it is "better" to have a major in computer science than in psychology or in English. Generally speaking, for a binary variable like pass/fail, ordinal or nominal consideration does not matter.

It should also be noted that numerically meaningful variables can be associated with any of the data types above, even the nominal type. For example, the gender categories of "man" and "woman" would themselves not be numerically meaningful, but if we let \(X\) be the number of men in a random sample, that would be considered a quantitative (random) variable.

Context is important! The context of the study and the relevant questions of interest are important in specifying what kind of variable we will analyze.

#### Examples

- Did you get the flu? (Yes or No) -- is a binary nominal categorical variable
- What was the severity of your flu? (Low, Medium, or High) -- is an ordinal categorical variable

## Measurement Hierarchy

The main distinction between nominal and ordinal data is that the latter has a natural ordering (least to greatest, best to worst, etc.), whereas the former does not. If the ordered characteristic is ignored, however, ordinal data could be considered a special case of nominal data. Similarly, discrete quantitative data could be considered a special case of ordinal data, with the additional characteristic that values have numerical meaning. So, computations like differences and averages make sense. Thus, the hierarchy is

nominal < ordinal < quantitative

In terms of analyses, methods applicable for one type of variable can be used for the variables at higher levels too (but not at lower levels). For example, methods designed for nominal data can be used for ordinal data but not vice versa. However, keep in mind that an analysis method may not be optimal if it ignores information available in the data.

One final note on the organization of these types is that quantitative variables may be further divided into "interval" and "ratio" types, depending on whether operations of subtraction and division make sense, but we will rarely need to make such distinction in this course.

## Frequency Counts

While often not numerically meaningful originally, discrete data can be summarized with the frequency counts of individuals falling in the categories. If more than one variable is involved, counts can be measured either jointly or marginally for one variable by summing over categories of the other variable. Here are some examples.

## Example: Eye Color

This is a typical frequency table for a single categorical variable*.* A sample of *n* = 96 persons is obtained, and the eye color of each person is recorded. The table then summarizes the responses by their frequencies.

Eye color | Count |
---|---|

Brown | 46 |

Blue | 22 |

Green | 26 |

Other | 2 |

Total | 96 |

**nominal variable**.

## Example: Admissions Data

A university offers only two-degree programs: English and computer science. Admission is competitive, and there is suspicion of discrimination against women in the admission process. Here is a two-way table of counts of all applicants by sex and admission status. These data can be used to measure the association between the sex of the applicants and their success in obtaining admission.

Admit | Deny | Total | |
---|---|---|---|

Male | 35 | 45 | 80 |

Female | 20 | 40 | 60 |

Total | 55 |
85 |
140 |

## Example: Attitudes Towards War

Hypothetical attitudes of *n* = 116 people towards war. They were asked to state their opinion on a 5 point scale regarding the statement: "This is a necessary war".

Attitude | Count |
---|---|

Strongly disagree | 35 |

Disagree | 27 |

Agree | 23 |

Strongly agree | 31 |

Total | 116 |

**ordinal**variable.

## Example: Attitudes Towards War (cont.)

Working from the example above, suppose now that in addition to the four ordered categories, outcomes where the person wasn't sure or refused to answer were also recorded, giving *n* = 130 total counts divided up as follows.

Attitude | Count |
---|---|

Strongly disagree | 35 |

Disagree | 27 |

Agree | 23 |

Strongly agree | 31 |

Not sure | 6 |

Refusal | 8 |

Total | 130 |

**partially ordered**.

## Example: Dice Rolls

Suppose a six-sided die is rolled 30 times, and the die face that comes up is recorded. One possible set of outcomes is tabulated below.

Face | Count |
---|---|

1 | 3 |

2 | 7 |

3 | 5 |

4 | 10 |

5 | 2 |

6 | 3 |

Total | 30 |

**nominal**or

**ordinal**, depending on the context.

## Example: Number of Children in Families

Here's an example where the response categories are numerically meaningful: the number of children in *n* = 100 randomly selected families.

Number of children | Count |
---|---|

0 | 19 |

1 | 26 |

2 | 29 |

3 | 13 |

4-5 | 11 |

6+ | 2 |

Total | 100 |

**coarsened numeric data.**

## Example: Household Incomes

The variable in this example is total gross income, recorded for a sample of *n* = 100 households.

Income | Count |
---|---|

below \$10,000 | 11 |

\$10,000–\$24,999 | 23 |

\$25,000–\$39,999 | 30 |

\$40,000–\$59,999 | 24 |

\$60,000 and above | 12 |

Total | 100 |

The original data (raw incomes) were essentially continuous, but any type of data, continuous or discrete, can be grouped or coarsened into categories.

Grouping data will typically result in some loss of information. How much information is lost depends on

- the number of categories and
- the question being addressed.

In this example, grouping has somewhat diminished our ability to estimate the mean or median household income. Our ability to estimate the proportion of households with incomes below \$10,000 has not been affected, but estimating the proportion of households with incomes above \$75,000 is now virtually impossible.

# 1.2 - Graphical Displays for Discrete Data

1.2 - Graphical Displays for Discrete DataIn the examples below, political party, sex, and general happiness are selected variables from the 2018 General Social Survey. Some of the original response categories were omitted or combined to simplify the interpretations; details are in the R code below.

## Bar Plots

Not to be confused with a histogram, a **bar plot** is used for discrete or categorical data that is not continuous in nature. For this reason, bar plots are typically displayed with gaps between columns, unless certain groupings are to be emphasized. The height of each column can represent either a frequency count or a proportion for the corresponding category.

For nominal variables, such as *Party ID* and *Sex*, a simple bar plot is an effective way to illustrate the relative sizes of categories.

When plotting two variables together, one can be displayed in more of an explanatory role. Notice the difference in the way the following two plots are presenting the same data. The first is illustrating the distribution of *Sex *for each *Party ID* category, which puts *Party ID* in more of the explanatory role; the second is reversing these roles.

In most software packages, the default ordering for bar plot categories is alphabetical, which is usually fine for nominal data, but we can (and should) change the order to better represent ordinal data. In the plot below, categories for *Happy *are sorted from least happiness to greatest happiness.

## Mosaic Plots

A visual display particularly well-suited for illustrating joint distributions for two (or more) discrete variables is the **mosaic plot**. Compared with the bar plot, category sizes in the mosaic plot more directly represent proportions of a whole. Compare the figure below to the bar plot for *Happy* above. This can potentially be misleading, however, if some categories are omitted. For this particular example, it should be understood that the additional responses of "No answer" and "Don't know" were possible but omitted for convenience.

In the case of two variables, the mosaic plot can illustrate their association. As with the bar plot above, one variable can play more of an explanatory role, depending on how the details are arranged. In the figure below, notice the vertical division by sex is slightly off-center. This gives the marginal information for *Sex *(the proportion of females was greater in this sample). *Sex *also plays the role of the explanatory variable in this plot in that the distribution of *Party ID* is viewed within each sex category. Thus, we see that among females, the proportion of Democrats is slightly higher, compared with the proportion of Democrats among males.

## R

The R code to recreate the plots above:

```
library(dplyr)
gss = read.csv(file.choose(), header=T) # "GSS.csv"
str(gss) # structure
# omitting outlying responses
gss = gss[gss$partyid!="No answer",]
gss = gss[(gss$happy!="Don't know") & (gss$happy!="No answer"),]
# combine categories of partyid
gss$partyid = recode(gss$partyid,
"Ind,near dem" = "Independent",
"Ind,near rep" = "Independent",
"Not str democrat" = "Democrat",
"Strong democrat" = "Democrat",
"Not str republican" = "Republican",
"Strong republican" = "Republican")
# bar charts
party.tab = table(gss$partyid)
party.tab
prop.table(party.tab)
barplot(party.tab, main="Party ID")
two.tab = table(gss$sex, gss$partyid)
two.tab
prop.table(two.tab, margin=1) # row proportions
barplot(two.tab, legend=T, main="Party ID vs Sex")
barplot(two.tab, legend=T, main="Party ID vs Sex", beside=T)
barplot(table(gss$partyid, gss$sex), legend=T, main="Party ID vs Sex")
# ordered
gss$happy = factor(gss$happy,
levels = c("Not too happy", "Pretty happy", "Very happy"))
happy.tab = table(gss$happy)
happy.tab
prop.table(happy.tab)
barplot(happy.tab, main="General happines")
# mosaic plots
mosaicplot(happy.tab, main="General happiness")
dimnames(two.tab)
dimnames(two.tab)[[2]] = c("Dem","Ind","Other","Rep")
mosaicplot(two.tab, main="Party ID vs Sex", color=T)
```

# 1.3 - Discrete Distributions

1.3 - Discrete DistributionsStatistical inference requires assumptions about the probability distribution (i.e., random mechanism, sampling model) that generated the data. For example, for a t-test, we assume that the sample mean follows a normal distribution. Some common distributions used for discrete data are introduced in this section.

Recall, a *random variable* is the outcome of an experiment (i.e., a random process) expressed as a number. We tend to use capital letters near the end of the alphabet (X, Y, Z, etc.) to denote random variables. Random variables are of two types: discrete and continuous. Here we are interested in distributions of discrete random variables.

**discrete random variable**

*X*is described by its

**probability mass function (PMF)**, which we will also call its

**distribution**, \(f(x)=P(X =x)\). The set of x-values for which \(f (x) > 0\) is called the

**support**

*.*Support can be

**finite**

*,*e.g.,

*X*can take the values in \({0,1,2,\dots,n}\) or countably infinite if

*X*takes values in \({0,1,\dots}\)

*.*Note, if the distribution depends on an unknown parameter \(\theta\) we can write it as \(f (x; \theta)\) or \(f(x| \theta)\).

Here are some distributions that you may encounter when analyzing discrete data.

## Bernoulli distribution

The most basic of all discrete random variables is the Bernoulli.

*X* is said to have a Bernoulli distribution if \(X = 1\) occurs with probability \(\pi\) and \(X = 0\) occurs with probability \(1 − \pi\) ,

\(f(x)=\left\{\begin{array} {cl} \pi & x=1 \\ 1-\pi & x=0 \\ 0 & \text{otherwise} \end{array} \right. \)

Another common way to write it is...

\(f(x)=\pi^x (1-\pi)^{1-x}\text{ for }x=0,1\)

Suppose an experiment has only two possible outcomes, "success" and "failure," and let \(\pi\) be the probability of a success. If we let *X* denote the number of successes (either zero or one), then *X* will be Bernoulli. The mean (or expected value) of a Bernoulli random variable is

\(E(X)=1(\pi)+0(1-\pi)=\pi\),

and the variance is...

\(V(X)=E(X^2)-[E(X)]^2=1^2\pi+0^2(1-\pi)-\pi^2=\pi(1-\pi)\).

## Binomial distribution

Suppose that \(X_1,X_2,\ldots,X_n\) are independent and identically distributed (iid) Bernoulli random variables, each having the distribution

\(f(x_i)=\pi^{x_i}(1-\pi)^{1-x_i}\text{ for }x_i=0,1 \text{ and } 0≤ \pi ≤ 1\)

Let \(X=X_1+X_2+\ldots+X_n\). Then *X* is said to have a binomial distribution with parameters *n* and *p*,

\(X\sim Bin(n,\pi)\).

For example, if a fair coin is tossed 100 times, the number of times heads is observed will have a binomial distribution (with \(n=100\) and \(\pi=.5\)). The binomial distribution has PMF

\(f(x)=\dfrac{n!}{x!(n-x)!} π^x (1-\pi)^{n-x} \text{ for }x_i=0,1,2,\ldots,n, \text{and } 0≤ \pi ≤ 1.\)

And by the independence assumption, we can show that

\(E(X)=E(X_1)+E(X_2)+\cdots+E(X_n)=n\pi\)

and

\(V(X)=V(X_1)+V(X_2)+\cdots+V(X_n)=n\pi(1-\pi)\).

Note that *X* will not have an exact binomial distribution if the probability of success \(\pi\) is not constant from trial to trial or if the trials are not independent (i.e., the outcome on one trial alters the probability of an outcome on another trial). However, the binomial distribution can still serve as an effective approximation if these violations are negligible.

#### Example: Smartphone users

For example, consider sampling 20 smartphone users in the U.S. and recording *X = *the number that use Android. If the nationwide percentage of Android users is \(\pi\), then *X *is approximately binomial with 20 trials and success probability \(\pi\), even though technically \(\pi\) would change slightly each time a user is pulled out of the population for sampling. As long as the population (all U.S. smartphone users) is large relative to the sample, this issue is negligible. If this is not the case, however, then we should account for this, which is what the hypergeometric distribution does.

## Hypergeometric distribution

Suppose there's a population of \(n\) objects with \(n_1\) of type 1 (success) and \(n_2 = n − n_1\)* *of type 2 (failure), and *m* (less than *n*) objects are sampled without replacement from this population. Then, the number of successes *X *among the sample is a hypergeometric random variable with PMF

\(\displaystyle f(x) = \dfrac{\binom{n_1}{ x}\binom{n_2}{m - x}}{\binom{n}{m}},\;\;\;\; x \in [\max(0, m-n_2); \min(n_1, m)] \)

The restrictions are needed in the support because we cannot draw more successes or failures in the sample than what exist in the population. The expectation and variance of *X* are given by

\(E(X) =\dfrac{n_1m}{n}\) and \(V(X)=\dfrac{n_1n_2m(n-m)}{n^2(n-1)}\).

## Poisson distribution

The Poisson distribution is another important one for modeling discrete events occurring in time or in space.

\(f(x)= Pr(X=x)= \dfrac{\lambda^x e^{-\lambda}}{x!}, x=0,1,2,\ldots, \mbox{ and }, \lambda>0.\)

For example, let \(X\) be the number of emails arriving at a server in one hour. Suppose that in the long run, the average number of emails arriving per hour is \(\lambda\). Then it may be reasonable to assume \(X \sim Poisson(\lambda)\). For the Poisson model to hold, however, the average arrival rate \(\lambda\) must be fairly constant over time; i.e., there should be no systematic or predictable changes in the arrival rate. Moreover, the arrivals should be independent of one another; i.e., the arrival of one email should not make the arrival of another email more or less likely.

The Poisson is also the limiting case of the binomial. Suppose that \(X\sim Bin(n,\pi)\) and let \(n\rightarrow\infty\) and \(\pi\rightarrow 0\) in such a way that \(n\pi\rightarrow\lambda\) where \(\lambda\) is a constant. Then, in the limit, \(X\sim Poisson(\lambda)\). Because of this, it is useful as an approximation to the binomial when \(n\) is large and \(\pi\) is small. That is, if \(n\) is large and \(\pi\) is small, then

\(\dfrac{n!}{x!(n-x)!}\pi^x(1-\pi)^{n-x} \approx \dfrac{\lambda^x e^{-\lambda}}{x!}\)

where \(\lambda = n\pi\). The right-hand side above is typically easier to calculate than the left-hand side.

Another interesting property of the Poisson distribution is that \(E(X) = V(X) = \lambda\), and this may be too restrictive for some data, where the variance exceeds the mean. This is known as *overdispersion *and may require an adjustment to the Poisson assumption or a different distribution altogether. One such option is the negative binomial.

## Negative-Binomial distribution

Whereas the binomial distribution describes the random number of successes in a fixed number of trials, the negative binomial distribution describes the random number of failures before observing a fixed number *r *of successes.

\(\displaystyle f(x)={r+x-1\choose x}\pi^r(1-\pi)^{x},\quad\mbox{for }x=0,1,\ldots\)

Like the Poisson, the negative binomial distribution can also be used to model counts of phenomena, but unlike the Poisson, the negative binomial has an additional parameter that allows the mean and variance to be estimated separately, which is often a better fit to the data. Specifically, we have for the negative binomial distribution

\(E(X)=\dfrac{r(1-\pi)}{\pi}=\mu\mbox{ and } V(X)=\mu+\dfrac{1}{r}\mu^2\)

## Multinomial distribution

The multinomial distribution generalizes the binomial to cases involving \(k\) outcomes with probabilities \(\pi_1,\ldots,\pi_k\). We still need a fixed number of independent trials *n*, but instead of counting only the number of one particular "success" outcome, we let \(X_j\) count the number of times the \(j\)th outcome occurs, resulting in the multivariate random vector \(X_1,\ldots,X_k\).

\(f(x_1,\ldots,x_k)=\dfrac{n!}{x_1!x_2!\cdots x_k!} \pi_1^{x_1}\pi_2^{x_2}\cdots \pi_k^{x_k}\) , \(x=(x_1,\ldots,x_k)\)

In addition to the mean and variance of \(X_j\), given by

\(E(X_j)=n\pi_j\) and \(V(X_j)=n\pi_j(1-\pi_j)\),

there is also a covariance between different outcome counts \(X_i\) and \(X_j\):

\(cov(X_i,X_j)=-n\pi_i\pi_j\)

Intuitively, this negative relationship makes sense, given the fixed total *n. *In other words, the more often one outcome occurs, the less often other outcomes must occur if \(X_1+\cdots+X_k=n\) is to be preserved. Finally, note that if other outcomes are lumped together as "failure", each marginal count \(X_j\) has a binomial distribution with *n *trials and success probability \(\pi_j\).

## Note on Technology

There are built-in R and SAS functions to compute various quantities for these distributions or to generate random samples.

In R, at the prompt type ** help(Binomial)**, **help(NegBinomial)**, **help(poisson), **etc. to learn more.

See the SAS User's Guide for examples.

# 1.4 - Sampling Schemes

1.4 - Sampling Schemes#### Stop and Think!

What are some ways of generating these one-way tables of counts?

Why do you think we care about the random mechanism that generated the data?

Any data analysis requires some assumptions about the data generation process. For continuous data and linear regression, for example, we assume that the response variable has been randomly generated from a normal distribution. For categorical data, we will often assume that data have been generated from a Poisson, binomial, or multinomial distribution. Statistical analysis depends on the data generation mechanism, although depending on the objective, we may be able to ignore that mechanism and simplify our analysis.

The following sampling methods correspond to the distributions considered:

- Unrestricted sampling (corresponds to
**Poisson distribution**) - Sampling with fixed total sample size (corresponds to
**Binomial or Multinomial distributions**)

## Poisson Sampling

Poisson sampling assumes that the random mechanism to generate the data can be described by a Poisson distribution. It is useful for modeling counts or events that occur randomly over a fixed period of time or in a fixed space. It can also be used as an approximation to the binomial distribution when the success probability of a trial is very small, but the number of trials is very large. For example, consider the number of emails you receive between 4 p.m. and 5 p.m. on a Friday.

Or, let *X* be the number of goals scored in a professional soccer game. We may model this as *X* ∼ *Poisson*(\(\lambda\)):

\(P(X=x)=\dfrac{\lambda^x e^{-\lambda}}{x!}\qquad x=0,1,2,\ldots\)

The parameter \(\lambda\) represents the expected number of goals in the game or the long-run average among all possible such games. The expression *x!* stands for *x ***factorial**, i.e., \(x!=1*2*3*\dots*x. P(X=x)\) or P(x) is the probability that *X* (the random variable representing the unknown number of goals in the game) will take on the particular value *x. *That is, *X* is random, but *x* is not.

#### The Poisson Model (distribution) Assumptions

**Independence**: Events must be independent (e.g. the number of goals scored by a team should not make the number of goals scored by another team more or less likely.)**Homogeneity**: The mean number of goals scored is assumed to be the same for all teams.- Time period (or space) must be fixed

Recall that mean and variance of Poisson distribution are the same; e.g., \(E(X) = Var(X) = \lambda\). However, in practice, the observed variance is usually larger than the theoretical variance and in the case of Poisson, larger than its mean. This is known as overdispersion, an important concept that occurs with discrete data. We assumed that each team has the same probability of in each match of the first round of scoring goals, but it's more realistic to assume that these probabilities will vary by the team's skill, the day the matches were played because of the weather, maybe even if the order of the matches, etc. Then we may observe more variations in the scoring than the Poisson model predicts. Analyses assuming binomial, Poisson or multinomial distributions are sometimes invalid because of overdispersion. We will see more on this later when we study logistic regression and Poisson regression models.

## Binomial Sampling

When data are collected on a pre-determined number of units and are then classified according to two levels of a categorical variable, a binomial sampling emerges. Consider the sample of 20 smartphone users, where each individual either uses Android or not. In this study, there was a fixed number of trials (e.g., fixed number of smartphone users surveyed, \(n=20\)), and the researcher counted the number \(X\) of "successes". We can then use the binomial probability distribution (i.e., binomial model), to describe \(X\).

Binomial distributions are characterized by two parameters: \(n\), which is fixed---this could be the number of trials or the total sample size if we think in terms of sampling---and \(\pi\), which usually denotes a probability of "success". In our example, this would be the probability that a smartphone user uses Android. Please note that some textbooks will use \(\pi\) to denote the population parameter and \(p\) to denote the sample estimate, whereas some may use \(p\)* *for the population parameters as well. The context should make it clear whether we're referring to a population or sample value. Once we know \(n\) and \(\pi\), the probability of success, we know everything about that binomial distribution, including its mean and variance.

#### Binomial Model (distribution) Assumptions

**Fixed \(n\)**: the total number of trials/events, (or total sample size) is fixed.**Each event has two possible outcomes**, referred to as "success" or "failure"**Independent and Identical Events/Trials**:- Identical trials mean that the probability of success is the same for each trial.
- Independent means that the outcome of one trial does not affect the outcome of the other.

## Multinomial Sampling

Multinomial sampling may be considered as a generalization of Binomial sampling. Data are collected on a pre-determined number of individuals or trials and classified into one of \(k\) categorical outcomes.

#### Multinomial Model (distribution) Assumptions:

- the \(n\) trials are independent, and
- the parameter vector \(\pi\) remains constant from trial to trial.

The most common violation of these assumptions occurs when clustering is present in the data. Clustering means that some of the trials occur in groups or clusters, and that trials within a cluster tend to have outcomes that are more similar than trials from different clusters. Clustering can be thought of as a violation of either (a) or (b).

#### Example: Eye Color

In this example, eye color was recorded for *n* = 96 persons.

Eye color | Count |
---|---|

Brown | 46 |

Blue | 22 |

Green | 26 |

Other | 2 |

Total | 96 |

Suppose that the sample included members from the same family as well as unrelated individuals. Persons from the same family are more likely to have similar eye color than unrelated persons, so the assumptions of the multinomial model would be violated. If both parents have brown eye color, it is very likely that their offspring will also have brown eye color. Whereas eye color of family members related by marriage will not violate the multinomial assumption, distribution of eye color of blood relations will.

Now suppose that the sample consisted of "unrelated" persons randomly selected within Pennsylvania. In other words, persons are randomly selected from a list of Pennsylvania residents. If two members of the same family happen to be selected into the sample purely by chance, that's okay; the important thing is that each person on the list has an equal chance of being selected, regardless of who else is selected.

#### Stop and Think!

Based on what we've seen so far with the multinomial distribution and multinomial sampling, can you answer the following questions?

- If we
"other" eye color with "brown", how does the distribution change; i.e., What is the Mulitnomial distribution now?*fuse* - It turns out that we can
"brown" eyes as 20 with "hazel" color and 26 with "dark brown". How would you characterize these distributions now?*partition*

# 1.5 - Maximum Likelihood Estimation

1.5 - Maximum Likelihood EstimationOne of the most fundamental concepts of modern statistics is that of likelihood. In each of the discrete random variables we have considered thus far, the distribution depends on one or more parameters that are, in most statistical applications, unknown. In the Poisson distribution, the parameter is \(\lambda\). In the binomial, the parameter of interest is \(\pi\) (since *n* is typically fixed and known).

The likelihood function is essentially the distribution of a random variable (or joint distribution of all values if a sample of the random variable is obtained) viewed as a function of the parameter(s). The reason for viewing it this way is that the data values will be observed and can be substituted in, and the value of the unknown parameter that maximizes this likelihood function can then be found. The intuition is that this maximizing value is the one that makes our observed data most likely.

## Bernoulli and Binomial Likelihoods

Consider a random sample of \(n\) Bernoulli random variables, \(X_1,\ldots,X_n\),* *each with PMF

\(f(x)=\pi^x(1-\pi)^{1-x}\qquad x=0,1\)

The likelihood function is the joint distribution of these sample values, which we can write by independence

\(\ell(\pi)=f(x_1,\ldots,x_n;\pi)=\pi^{\sum_i x_i}(1-\pi)^{n-\sum_i x_i}\)

We interpret \(\ell(\pi)\) as the probability of observing \(X_1,\ldots,X_n\) as a function of \(\pi\), and the maximum likelihood estimate (MLE) of \(\pi\) is the value of \(\pi\) that maximizes this probability function. Equivalently, \(L(\pi)=\log\ell(\pi)\) is maximized at the same value and can be used interchangeably; more often than not, the loglikelihood function is easier to work with.

You may have noticed that the likelihood function for the sample of Bernoulli random variables depends only on their sum, which we can write as \(Y=\sum_i X_i\). Since \(Y\) has a binomial distribution with \(n\) trials and success probability \(\pi\), we can write its log likelihood function as

\(\displaystyle L(\pi) = \log {n\choose y} \pi^y(1 - \pi)^{n-y}\)

The only difference between this log likelihood function and that for the Bernoulli sample is the presence of the binomial coefficient \({n\choose y}\). But since that doesn't depend on \(\pi\), it has no influence on the MLE and may be neglected.

With a little calculus (taking the derivative with respect to \(\pi\)), we can show that the value of \(\pi\) that maximizes the likelihood (and log likelihood) function is \(Y/n\), which we denote as the MLE \(\hat{\pi}\). Not surprisingly, this is the familiar sample proportion of successes that intuitively makes sense as a good estimate for the population proportion.

## Example: Binomial Example 1

If in our earlier binomial sample of 20 smartphone users, we observe 8 that use Android, the MLE for \(\pi\) is then \(8/20=.4\). The plot below illustrates this maximizing value for both the likelihood and log likelihood functions. The "dbinom" function is the PMF for the binomial distribution.

```
likeli.plot = function(y,n)
{
L = function(p) dbinom(y,n,p)
mle = optimize(L, interval=c(0,1), maximum=TRUE)$max
p = (1:100)/100
par(mfrow=c(2,1))
plot(p, L(p), type='l')
abline(v=mle)
plot(p, log(L(p)), type='l')
abline(v=mle)
mle
}
likeli.plot(8,20)
```

## Example: Binomial Example 2

We know that the likelihood function achieves its maximum value at the MLE, but how is the sample size related to the shape? Suppose that we observe \(X = 1\) from a binomial distribution with \(n = 4\) and \(\pi\). The MLE is then \(1/4=0.25\), and the graph of this function looks like this.

Here is the program for creating this plot in SAS.

```
data for_plot;
do x=0.01 to 0.8 by 0.01;
y=log(x)+3*log(1-x); *the log-likelihood function;
output;
end;
run;
/*plot options*/
goption reset=all colors=(black);
symbol1 i=spline line=1;
axis1 order=(0 to 1.0 by 0.2);
proc gplot data=for_plot;
plot y*x / haxis=axis1 ;
run;
quit;
```

Now suppose that we observe \(X = 10\) from a binomial distribution with \(n = 40\). The MLE is again \(\hat{\pi}=10/40=0.25\), but the loglikelihood is now narrow:

Finally, suppose that we observe \(X = 100\) from a binomial with \(n = 400\). The MLE is still \(\hat{\pi}=100/400=0.25\), but the loglikelihood is now narrower still:

As \(n\) gets larger, we observe that \(L(\pi)\) is becoming more sharply peaked around the MLE \(\hat{pi}\), which means the true parameter lies close to \(\hat{\pi}\). If the loglikelihood is highly peaked—that is, if it drops sharply as we move away from the MLE—then the evidence is strong that \(\pi\) is near the MLE. A flatter loglikelihood, on the other hand, means that more values are plausible.

### Poisson Likelihood

Suppose that \(X = (X_1, X_2, \dots, X_n)\) are iid observations from a Poisson distribution with unknown parameter \(\lambda\). The likelihood function is

\begin{aligned} L(\lambda) =\prod\limits_{i=1}^{n} f\left(x_{i} ; \lambda\right) =\prod\limits_{i=1}^{n} \dfrac{\lambda^{x_{i}} e^{-\lambda}}{x_{i} !} =\dfrac{\lambda^{\sum_i x_{i}} e^{-n \lambda}}{x_{1} ! x_{2} ! \cdots x_{n} !} \end{aligned}

The corresponding loglikelihood function is

\(\sum\limits_{i=1}^{n} x_i\log\lambda-n\lambda-\sum\limits_{i=1}^{n} \log x_i!\)

And the MLE for \(\lambda\) can then be found by maximizing either of these with respect to \(\lambda\). Setting the first derivative equal to 0 gives the solution:

\(\hat{\lambda}=\sum\limits_{i=1}^{n} \dfrac{x_i}{n}\).

Thus, for a Poisson sample, the MLE for \(\lambda\) is just the sample mean.

# 1.6 - Lesson 1 Summary

1.6 - Lesson 1 SummaryIn this lesson, we introduced discrete data and some common visual displays for illustrating its relevant characteristics. We briefly introduced several of the popular distributions used to model discrete data, focusing in particular on the binomial. And we saw how a random variable's distribution or likelihood function can be used to provide an intuitive estimate for an unknown population parameter.

In the next lesson, we'll continue the idea of parameter estimation to include margins of error in order to make inferences about the population in the form of confidence intervals and hypothesis tests. We'll also extend the binomial focus to cases involving more than two outcome categories.