12.2 - The Algorithm

The two necessary conditions on the previous page inspired the k-means algorithm.

It is interesting to note that in data compression for electrical engineers, k-means is better known as the Lloyd algorithm.

The k-means algorithm alternates the two steps:

  • For a fixed set of centroids (prototypes), optimize A(•) by assigning each sample to its closest centroid using Euclidean distance.
  • Update the centroids by computing the average of all the samples assigned to it.

The algorithm converges since, after each iteration, the objective function is non-increasing.

Based on the necessary conditions, the algorithm improves the prototypes and the assignment function iteratively. It is possible that the steps taken might not lead to any change in either the prototypes or the partition. When either becomes static, the algorithm has converged and there is no need to continue.

How do we decide when to stop?

One criterion for stopping is if we observe the assignment functions in the two iterations are exactly the same. If the assignment function doesn't change anymore, then the prototypes won't change either (and vice versa).

In practice, we often stop when the decrease in the objective function becomes small. We can compute the ratio between the decrease and the value of the objective function in the previous iteration. If the ratio is below a threshold, e.g., 0.0001, the iteration is stopped.

So, how do we initiate this procedure? How does it get started?