10.1 - Bayes Rule and Classification Problem

Bayes’ Rule

Consider any two events A and B. To find \(P(B|A)\), the probability that B occurs given that A has occurred, Bayes’ Rule states the following:

\(P(B|A) = \dfrac{P(A \text{ and } B)}{P(A)}\)

This says that conditional probability is the probability that both A and B occur divided by the unconditional probability that A occurs. This is a simple algebraic restatement of a rule for finding the probability that two events occur together, which is \(P(A\ and\ B) = P(A)P(B|A)\).

Bayes’ Rule Applied to the Classification Problem

We are interested in \(P(\pi_{i} | \boldsymbol{x})\), the conditional probability that observation came from population \(\pi_{i}\) given that the observed values of the multivariate vector of variables \(\boldsymbol{x}\). We will classify an observation to the population for which the value of P(\(\pi_{i} | \boldsymbol{x})\) is greatest. This is the most probable group given the observed values of \(\boldsymbol{x}\).

Suppose that we have g populations (groups) and that the \(i^{th}\) population is denoted as \(\pi_{i}\).
Let \(p_{i}=P(\pi_{i})\), be the probability that a randomly selected observation is in population \(\pi_{i}\).
Let \(f(\boldsymbol{x} | \pi_{i}\)) be the conditional probability density function of the multivariate set of variables \(\boldsymbol{x}\), given that the observation came from population \(\pi_{i}\).

Note! We have to be careful about the word probability in conjunction with our observed vector \(\mathbf{x}\). A probability density function for continuous variables does not give a probability, but instead gives a measure of “likelihood.”

Using the notation of Bayes’ Rule above, event A = observing the vector \(\boldsymbol{x}\) and event B = observation came from population \(\pi_{i}\). Thus our probability of interest can be found as...

\(P(\text{member of } \pi_i | \text{ we observed } \mathbf{x}) = \dfrac{P(\text{member of } \pi_i \text{ and we observe } \mathbf{x})}{P(\text{we observe } \mathbf{x})}\)

The numerator of the expression just given is the likelihood that a randomly selected observation is both from population \(\pi_{i}\) and has the value \(\boldsymbol{x}\). This likelihood = \(p_{i}f(\boldsymbol{x}| \pi_{i})\).
The denominator is the unconditional likelihood (overall populations) that we could observe \(\boldsymbol{x}\). This likelihood = \(\sum_{j=1}^{g} p_j f(\mathbf{x}|\pi_j)\)

Thus the posterior probability that an observation is a member of population \(\pi_{i}\) is

\(p(\pi_i|\mathbf{x}) = \dfrac{p_i f(\mathbf{x}|\pi_i)}{\sum_{j=1}^{g}p_j f(\mathbf{x}|\pi_j)}\)

The classification rule is to assign observation \(\boldsymbol{x}\) to the population for which the posterior probability is the greatest.

The denominator is the same for all posterior probabilities (for the various populations) so it is equivalent to say that we will classify an observation to the population for which \(p_{i}f (\boldsymbol{x}\) | \(\pi_{i})\) is greatest.

Two Populations

With only two populations we can express a classification rule in terms of the ratio of the two posterior probabilities. Specifically, we would classify to population 1 when

\(\dfrac{p_1 f(\mathbf{x}|\pi_1)}{p_2 f(\mathbf{x}|\pi_2)} > 1\)

This can be rewritten to say that we classify to population 1 when

\(\dfrac{ f(\mathbf{x}|\pi_1)}{ f(\mathbf{x}|\pi_2)} > \dfrac{p_2}{p_1}\)

Decision Rule

We are going to classify the sample unit or subject into the population \(\pi_{i}\) that maximizes the posterior probability p(\(\pi_{i}\)). that is the population that maximizes

\(f(\mathbf{x}|\pi_i)p_i\)

We are going to calculate the posterior probabilities for each of the populations. Then we are going to assign the subject or sample unit to that population that has the highest posterior probability. Ideally, that posterior probability is going to be greater than half, the closer to 100% the better!

Equivalently we are going to assign it to the population that maximizes this product:

\(\log f(\mathbf{x}|\pi_i)p_i\)

The denominator that appears above does not depend on the population because it involves summing over all the populations. Equivalently all we really need to do is to assign it to the population that has the largest for this product, or equivalently we can maximize the log of that product. A lot of times it is easier to write the log.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility