This final method that we would like to examine is a non-hierarchical approach. This method was presented by MacQueen (1967) in the Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability.
One advantage of this method is that we do not have to calculate the distance measures between all pairs of subjects. Therefore, this procedure seems much more efficient or practical when you have very large datasets.
Under this procedure, you need to pre-specify how many clusters to consider. The clusters in this procedure do not form a tree. There are two approaches to carrying out the K-Means procedure. The approaches vary as to how the procedure begins the partitioning. The first approach is to do this randomly, which is to start out with a random partitioning of subjects into groups and go from there. The alternative is to start with an additional set of starting points to form the centers of the clusters. The random nature of the first approach avoids bias.
Once this decision has been made, here is an overview of the process:
- Step 1: Partition the items into K initial clusters.
- Step 2: Scan through the list of n items, assigning each item to the cluster whose centroid (mean) is closest. Each time an item is reassigned, we recalculate the cluster mean or centroid for the cluster receiving that item and the cluster losing that item.
- Step 3: Repeat Step 2 over and over again until no more reassignments are made.
Let's look at a simple example in order to see how this works. Here is an example where we have four items and two variables:
Item | \(X_{1}\) | \(X_{2}\) |
---|---|---|
A | 7 | 9 |
B | 3 | 3 |
C | 4 | 1 |
D | 3 | 8 |
Suppose that we initially decide to partition the items into two clusters (A, B) and (C, D). The cluster centroids, or the mean of all the variables within the cluster, are as follows:
Cluster | \(\overline { x } _ { 1 }\) | \(\overline { x } _ { 2 }\) |
---|---|---|
(A, B) | \(\dfrac { 7 + 3 } { 2 } = 5\) | \(\dfrac { 9 + 3 } { 2 } = 6\) |
(C, D) | \(\dfrac{4+3}{2} = 3.5\) | \(\dfrac { 1 + 8 } { 2 } = 4.5\) |
For example, the mean of the first variable for cluster (A, B) is 5.
Next, we calculate the distances between item A and the centroids of clusters (A, B) and (C, D).
Cluster | Distance to A |
---|---|
(A, B) | \(\sqrt { ( 7 - 5 ) ^ { 2 } + ( 9 - 6 ) ^ { 2 } } = \sqrt { 13 }\) |
(C, D) | \(\sqrt { ( 7 - 3.5 ) ^ { 2 } + ( 9 - 4.5 ) ^ { 2 } } = \sqrt { 32.5 }\) |
This is the Euclidean distance between A and each of the cluster centroids. We see that item A is closer to cluster (A, B) than cluster (C, D). Therefore, we are going to leave item A in cluster (A, B) and no change is made at this point.
Next, we will look at the distance between item B and the centroids of clusters (A, B) and (C, D).
Cluster | Distance to B |
---|---|
(A, B) | \(\sqrt { ( 3 - 5 ) ^ { 2 } + ( 3 - 6 ) ^ { 2 } } = \sqrt { 13 }\) |
(C, D) | \(\sqrt { ( 3 - 3.5 ) ^ { 2 } + ( 3 - 4.5 ) ^ { 2 } } = \sqrt { 2.5 }\) |
Here, we see that item B is closer to cluster (C, D) than cluster (A, B). Therefore, item B will be reassigned, resulting in the new clusters (A) and (B, C, D).
The centroids of the new clusters are calculated as:
Cluster | \(\overline { x } _ { 1 }\) | \(\overline { x } _ { 2 }\) |
---|---|---|
(A) | 7 | 9 |
(B, C, D) | \(\frac { 3 + 4 + 3 } { 3 } = 3 . \overline { 3 }\) | \(\frac { 3 + 1 + 8 } { 3 } = 4\) |
Next, we will calculate the distance between the items and each of the clusters (A) and (B, C, D).
Cluster | C | D | A | B |
---|---|---|---|---|
(A) | \(\sqrt {73 }\) | \(\sqrt { 17 }\) | 0 | \(\sqrt { 52 }\) |
(B, C, D) | \(\sqrt { 9 . \overline { 4 } }\) | \(\sqrt { 16 . \overline { 1 } }\) | \(\sqrt { 38 . \overline { 4 } }\) | \(\sqrt { 1 . \overline { 1 } }\) |
It turns out that since all four items are closer to their current cluster centroids, no further reassignments are required.
We must note, however, that the results of the K-means procedure can be sensitive to the initial assignment of clusters.
For example, suppose the items had initially been assigned to the clusters (A, C) and (B, D). Then the cluster centroids would be calculated as follows:
Cluster | \(\overline { x } _ { 1 }\) | \(\overline { x } _ { 2 }\) |
---|---|---|
(A, C) | \(\dfrac { 7 + 4 } { 2 } = 5.5\) | \(\dfrac { 9 + 1 } { 2 } = 5\) |
(B, D) | \(\dfrac{3+3}{2} = 3\) | \(\dfrac { 3 + 8 } { 2 } = 5.5\) |
From here we can find that the distances between the items and the cluster centroids are:
Cluster | A | B | C | D |
---|---|---|---|---|
(A, C) | \(\sqrt {18.25}\) | \(\sqrt {10.25 }\) | \(\sqrt {18.25}\) | \(\sqrt {15.25 }\) |
(B, D) | \(\sqrt {28.25}\) | \(\sqrt {6.25 }\) | \(\sqrt {21.25 }\) | \(\sqrt {6.25}\) |
Question
If this is the case, then which result should be used as our summary?
We can compute the sum of squared distances between the items and their cluster centroid. For our first clustering scheme for clusters (A) and (B, C, D), we had the following distances to cluster centroids:
Cluster | C | D | A | B |
---|---|---|---|---|
(A) | \(\sqrt {73 }\) | \(\sqrt { 17 }\) | 0 | \(\sqrt { 52 }\) |
(B, C, D) | \(\sqrt { 9 . \overline { 4 } }\) | \(\sqrt { 16 . \overline { 1 } }\) | \(\sqrt { 38 . \overline { 4 } }\) | \(\sqrt { 1 . \overline { 1 } }\) |
So, the sum of squared distances is:
\(9.\bar{4} + 16.\bar{1} + 0 + 1.\bar{1} = 26.\bar{6}\)
For clusters (A, C) and (B, D), we had the following distances to cluster centroids:
Cluster | A | B | C | D |
---|---|---|---|---|
(A, C) | \(\sqrt {18.25 }\) | \(\sqrt {10.25 }\) | \(\sqrt {18.25 }\) | \(\sqrt {15.25 }\) |
(B, D) | \(\sqrt {28.25 }\) | \(\sqrt {6.25 }\) | \(\sqrt {21.25 }\) | \(\sqrt {6.25 }\) |
So, the sum of squared distances is:
\(18.25 + 6.25 + 18.25 + 6.25 = 49. 0\)
We would conclude that since \(26.\bar{6} < 49.0\), this would suggest that the first clustering scheme is better and partition the items into the clusters (A) and (B, C, D).
In practice, several initial clusters should be tried and compared to find the best results. A question arises, however, how should we define the initial clusters?