# 12.1 - K-Means

12.1 - K-MeansIn K-means let's assume there are *M* prototypes denoted by

\(Z = {z_1, z_2, \cdots , z_M}\)

This set is usually smaller than the original data set. If the data points reside in a *p*-dimensional Euclidean space, the prototypes reside in the same space. They will also be* p-*dimensional vectors. They may not be samples from the training data set, however, they should well represent the training dataset.

Each training sample is assigned to one of the prototypes. This is denoted using the assignment function by \(A(\cdot)\). Then \(A(x_i) = j\) means the \(i^{th}\) training sample is assigned to the \(j^{th}\) prototype, that is, the\(j^{th}\) prototype is used to approximate or represent point \(x_i\).

In k-means, we need to solve two unknowns. The first is to select a set of prototypes; the second is the assignment function.

### The Objective Function in K-Means

In K-means, the optimization criterion is to minimize the total squared error between the training samples and their representative prototypes. This is equivalent to minimizing the trace of the pooled within covariance matrix. The objective function is:

\(\text{arg } \underset{\mathcal{Z}, A}{\text{ min }}\sum_{i=1}^{N}||x_i-z_{A(x_i)}||^2 \)

For every point \(x_i\), first we find which *z _{j}* will be used to represent \(x_i\). As we mentioned above, if \(A(x_i) = j\) then the prototype \(z_j\) will be used. Hence, we compute the distance between \(x_i\) with \(z_{A(x_i)}\) . The norm in the above formula denotes the Euclidean distance.

Ideally, we want every data point to be perfectly represented by a prototype. Since this is generally impossible, we try to minimize the total discrepancy.

Denote the objective function by:

\(L(\mathcal{Z},A)=\sum_{i=1}^{N}||x_i-z_{A(x_i)}||^2\)

From the perspective of data compression, this seems obvious. But from the clustering perspective, one may ask why we use such an optimization criterion.

Here is the intuition for this...

Evaluating clustering results is inevitably subjective. For classification, we have the true labels. Therefore, we can compute the error rate to assess the result. However, in (unsupervised) clustering, we don’t have the class labels given for the training data.

One heuristic generally accepted is that points in the same cluster should be tight and points in different groups should be as far apart as possible. The k-means algorithm reflects the heuristic by attempting to minimize the total within-cluster distances between each data point and its corresponding prototype.

### Necessary Conditions

How do we optimize the objective function of k-means?

What follows are necessary conditions for an optimal solution. Necessary means that if a solution is indeed optimal, then it has to satisfy these conditions, but not vice versa. This means we might have a solution satisfying the conditions that are not optimal. We call such solutions locally optimal, in contrast to globally optimal.

**Condition 1**: If the prototypes, *Z*, are fixed, the optimal assignment function \(A(\cdot)\) (using Euclidean distance) should follow the *nearest neighbor rule*, that is,

\(A(x_i) = \text{ arg }\underset{j \in {1,2,...,M}}{ min } || x_i − z_j || \)

This means that we should always find among those prototypes the one closest to the new point. Intuitively this should be clear because we want to minimize the summed distance between every point and its prototype. We have the freedom to choose the prototype for each data point. Obviously, for a single point, the best prototype to choose is the closest one to it.

**Condition 2 **: Given a fixed assignment function, \(A(\cdot)\), the prototype, \(z_j\), should be the average (centroid) of all the samples assigned to the \(j^{th}\) prototype:

\(z_j=\frac{\Sigma_{i:A(x_i)=j}x_i}{N_j}\)

where \(N_j\) is the number of samples assigned to prototype* j*.

This is also easy to prove. For a given group of data points associated with a single prototype, the total sum of squared distances is a quadratic function in terms of the prototype. This function is convex and its minimum is unique and is given by setting the first order derivatives to zero, which leads to the arithmetic mean of the data points.

Sometimes we call the assignment function "the partition" and the prototypes "the centroids."