Lesson 12: Cluster Analysis

Lesson 12: Cluster Analysis

Overview

Textbook reading: Section 10.3.

Objectives

Upon successful completion of this lesson, you should be able to:

  • Understand the k-means clustering algorithm and know how to program it.
  • Understand the basic idea of agglomerative clustering and the various schemes of updating between-cluster distances.

12.1 - K-Means

12.1 - K-Means

In K-means let's assume there are M prototypes denoted by

\(Z = {z_1, z_2, \cdots , z_M}\)

This set is usually smaller than the original data set. If the data points reside in a p-dimensional Euclidean space, the prototypes reside in the same space. They will also be p-dimensional vectors. They may not be samples from the training data set, however, they should well represent the training dataset.

Each training sample is assigned to one of the prototypes. This is denoted using the assignment function by \(A(\cdot)\). Then \(A(x_i) = j\) means the \(i^{th}\) training sample is assigned to the \(j^{th}\) prototype, that is, the\(j^{th}\) prototype is used to approximate or represent point \(x_i\).

In k-means, we need to solve two unknowns. The first is to select a set of prototypes; the second is the assignment function.

The Objective Function in K-Means

In K-means, the optimization criterion is to minimize the total squared error between the training samples and their representative prototypes.  This is equivalent to minimizing the trace of the pooled within covariance matrix. The objective function is:

\(\text{arg } \underset{\mathcal{Z}, A}{\text{ min }}\sum_{i=1}^{N}||x_i-z_{A(x_i)}||^2 \)

For every point \(x_i\), first we find which zj will be used to represent \(x_i\). As we mentioned above, if \(A(x_i) = j\) then the prototype \(z_j\) will be used. Hence, we compute the distance between \(x_i\) with \(z_{A(x_i)}\) . The norm in the above formula denotes the Euclidean distance.

Ideally, we want every data point to be perfectly represented by a prototype.  Since this is generally impossible, we try to minimize the total discrepancy.

Denote the objective function by:

\(L(\mathcal{Z},A)=\sum_{i=1}^{N}||x_i-z_{A(x_i)}||^2\)

From the perspective of data compression, this seems obvious.   But from the clustering perspective, one may ask why we use such an optimization criterion.

Here is the intuition for this...

Evaluating clustering results is inevitably subjective.  For classification, we have the true labels. Therefore, we can compute the error rate to assess the result.  However, in (unsupervised) clustering, we don’t have the class labels given for the training data. 

One heuristic generally accepted is that points in the same cluster should be tight and points in different groups should be as far apart as possible. The k-means algorithm reflects the heuristic by attempting to minimize the total within-cluster distances between each data point and its corresponding prototype.

Necessary Conditions

How do we optimize the objective function of k-means?

What follows are necessary conditions for an optimal solution. Necessary means that if a solution is indeed optimal, then it has to satisfy these conditions, but not vice versa. This means we might have a solution satisfying the conditions that are not optimal.  We call such solutions locally optimal, in contrast to globally optimal.

Condition 1: If the prototypes, Z, are fixed, the optimal assignment function \(A(\cdot)\) (using Euclidean distance) should follow the nearest neighbor rule, that is,

\(A(x_i) = \text{ arg }\underset{j \in {1,2,...,M}}{ min } || x_i − z_j || \)

This means that we should always find among those prototypes the one closest to the new point. Intuitively this should be clear because we want to minimize the summed distance between every point and its prototype. We have the freedom to choose the prototype for each data point.  Obviously, for a single point, the best prototype to choose is the closest one to it.

Condition 2 : Given a fixed assignment function, \(A(\cdot)\), the prototype, \(z_j\), should be the average (centroid) of all the samples assigned to the \(j^{th}\) prototype:

\(z_j=\frac{\Sigma_{i:A(x_i)=j}x_i}{N_j}\)

where \(N_j\) is the number of samples assigned to prototype j.

This is also easy to prove.  For a given group of data points associated with a single prototype, the total sum of squared distances is a quadratic function in terms of the prototype.  This function is convex and its minimum is unique and is given by setting the first order derivatives to zero, which leads to the arithmetic mean of the data points.

Sometimes we call the assignment function "the partition" and the prototypes "the centroids."


12.2 - The Algorithm

12.2 - The Algorithm

The two necessary conditions on the previous page inspired the k-means algorithm.

It is interesting to note that in data compression for electrical engineers, k-means is better known as the Lloyd algorithm.

The k-means algorithm alternates the two steps:

  • For a fixed set of centroids (prototypes), optimize A(•) by assigning each sample to its closest centroid using Euclidean distance.
  • Update the centroids by computing the average of all the samples assigned to it.

The algorithm converges since, after each iteration, the objective function is non-increasing.

Based on the necessary conditions, the algorithm improves the prototypes and the assignment function iteratively. It is possible that the steps taken might not lead to any change in either the prototypes or the partition. When either becomes static, the algorithm has converged and there is no need to continue.

How do we decide when to stop?

One criterion for stopping is if we observe the assignment functions in the two iterations are exactly the same. If the assignment function doesn't change anymore, then the prototypes won't change either (and vice versa).

In practice, we often stop when the decrease in the objective function becomes small. We can compute the ratio between the decrease and the value of the objective function in the previous iteration. If the ratio is below a threshold, e.g., 0.0001, the iteration is stopped.

So, how do we initiate this procedure? How does it get started?


12.3 - Examples

12.3 - Examples

Example 1

To make things simple in this example we work with a training dataset of one-dimensional data. Our training data set contains six data points as follows:

Training set: {1.2, 5.6, 3.7, 0.6, 0.1, 2.6}.

Here we want to apply the k-means algorithm to get 2 prototypes, or centroids, \({z_1, z_2}\).

Initialization: I will start by randomly picking \(z_1 = 2\) for the first prototype, \(z_2 = 5\) for the second prototype. Let's see how k-means works ...

In the first step, we have the two prototypes initialized to 2 and 5. Now, with these two prototypes, the nearest neighbor partition of the sample points is performed. For every one of the six points, I compute its distance to 2 and 5, and find whichever is closer and assign the data point to that prototype. For instance, 1.2, 0.6, 0.1 and 2.6 are closer to 2 than to 5. Therefore, they are put together in the first group corresponding to the first prototype. The other data points, 5.6 and 3.7, are closer to 5 than to 2. They are put into the second prototype. The table below records our progress:

table

In the next step, I fix the partition. This means the grouping of the points will not be changed, however, we will compute the mean of the first group and the mean of the second group. We can see that the mean is not guaranteed to be the same as the values we started with. For the first group, the mean is 1.125, and for the second group, the mean is 4.65

Next, I repeat the iteration. We fix the two prototypes 1.125 and 4.65, then update the partition by the nearest neighbor rule. We will check every point to see whether it is closer to the first prototype or the second. This determines whether it goes into the first group or the second.

We find that the same four values are closer to 1.125 and the other two values are closer to 4.65. Here we can see that the assignment from the previous iteration is identical to the one just computed. This means that we no longer need to continue the iteration because if we do it again, we will get the same result for the prototypes. The algorithm already converged.

The final result is that the two prototypes are: \(z_1 = 1.125\), \(z_2 = 4.65\). The objective function (the Euclidean distance squared added over all the points) is \(L(Z, A) = 5.3125\).

Example 2

Next, we show that starting from a different initialization, k-means can lead to a different result.

This time let's say we randomly pick \(z_1 = 0.8\), \(z_2 = 3.8\) as our prototypes and we go through the same procedure. This time the table shows us that the data was divided into two groups. 1.2, 0.6, and 0.1 are closer to 0.8; and 5.6, 3.7, and 2.6 closer to 3.8. Then we compute the mean for each group. This results in 0.633 for the first group and 3.967 for the second group as shown below.

table

When the assignment function from the previous iteration is identical to the one just computed, we can stop. The final result this time is: \(z_1 = 0.633\), \(z_2 = 3.967\). The objective function is \(L(Z, A) = 5.2133\). The objective function value obtained in Example 1 was 5.3125. Therefore, this second result is better.

It can be shown that \({z_1 = 0.633, z_2 = 3.967}\) is the global optimal solution for this example. Unfortunately, however, until we get the results, there is no way to know at the beginning which initialization is going to be better. In practice, we do not have an efficient way to find the global optimal.


12.4 - Initialization

12.4 - Initialization

We bypassed this issue earlier. We had an example where we showed that k-means can result in different solutions for different initializations. The problem is that we do not have any theoretical guidance on how to pick the initial prototypes. Many times researchers don't know whether the initialization is a good one until they have completed the k-means algorithm and have the results in hand. Different ways have been proposed to initialize the k-means algorithm.

The simplest scheme is just to randomly pick some data points from the training data set and treat these as the initial prototypes. There are some other more sophisticated ways for initialization. Sometimes, people try different initializations and pick one that results in the minimum value for the objective function.

There are also methods that randomly pick the prototypes using another method to start the k-means iteration. For example, one may just pick a set of initial prototypes as random samples of a certain distribution.

Initialization in the above simulation:

  • Generated M random vectors with independent dimensions. For each dimension, the feature is uniformly distributed in [-1, 1].

For instance, suppose the dimension is p and the number of prototypes is M. Randomly pick a point in [-1,1] and do this p times to obtain a p-dimensional vector. If the random number generator generates a value in the range [0,1] you can simply multiply the value by 2 and subtract 1 to achieve a uniform distribution on [-1,1].

– Linearly transform the jth variable, \(Z_j , j = 1, 2, \cdots , p\) in each prototype (a vector) by: \(Z_js_j + m_j\), where \(s_j\) is the sample standard deviation of dimension j and \(m_j\) is the sample mean of dimension j, both computed using the training data.

During a k-means iteration, it may happen that a prototype has no data point assigned to it. We call this an “empty” prototype. When this happens, we can remove this prototype to reduce the number of prototypes by 1. If we don’t want to reduce the number of prototypes, we can further divide points in a non-empty prototype arbitrarily into two groups and compute the means for the two groups. The two means will serve as two prototypes to replace the previous prototype.


12.5 - R Scripts (K-means clustering)

12.5 - R Scripts (K-means clustering)

R

1. Acquire Data

Diabetes data

The diabetes data set is taken from the UCI machine learning database on Kaggle: Pima Indians Diabetes Database

  • 768 samples in the dataset
  • 8 quantitative variables
  • 2 classes; with or without signs of diabetes

Load data into R as follows:

# set the working directory
setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData = read.table("diabetes.data",sep = ",",header=FALSE)

In RawData, the response variable is its last column; and the remaining columns are the predictor variables.

responseY = RawData[,dim(RawData)[2]]
predictorX = RawData[,1:(dim(RawData)[2]-1)]

For the convenience of visualization, we take the first two principle components as the new feature variables and conduct k-means only on these two dimensional data.

pca = princomp(predictorX, cor=T) # principal components analysis using correlation matrix
pc.comp = pca$scores
pc.comp1 = -1*pc.comp[,1] # principal component 1 scores (negated for convenience)
pc.comp2 = -1*pc.comp[,2] # principal component 2 scores (negated for convenience)

2. K-Means

In R, kmeans performs the K-means clustering analysis, ()\$cluster provides the clustering results and ()\$centers provides the centroid vector (i.e., the mean) for each cluster.

X = cbind(pc.comp1, pc.comp2)
cl = kmeans(X,13)
cl\$cluster
plot(pc.comp1, pc.comp2,col=cl\$cluster)
points(cl$centers, pch=16)

Take k = 13 (as in the lecture note) as the number of clusters in K-means analysis. Figure 1 shows the resulting scatter plot with different clusters in different colors. The solid black circles are the centers of the clusters.

k-means result
Figure 1: K-means clustering result

12.6 - Agglomerative Clustering

12.6 - Agglomerative Clustering

Agglomerative clustering can be used as long as we have pairwise distances between any two objects. The mathematical representation of the objects is irrelevant when the pairwise distances are given. Hence agglomerative clustering readily applies for non-vector data.

Let's denote the data set as \(A = {x_1, \cdots , x_n}\).

The agglomerative clustering method is also called a bottom-up method as opposed to k-means or k-center methods that are top-down. In a top-down method, a data set is divided into more and more clusters. In a bottom-up approach, all the data points are treated as individual clusters to start with and gradually merged into bigger and bigger clusters.

In agglomerative clustering, clusters are generated hierarchically. We start by taking every data point as a cluster. Then we merge two clusters at a time. In order to decide which two clusters to merge, we compare the pairwise distances between any two clusters and pick a pair with the minimum distance. Once we merge two clusters into a bigger one, a new cluster is created. The distances between this new cluster and the existing ones are not given. Some scheme has to be used to obtain these distances based on the two merged clusters. We call this the update of distances. Various schemes of updating distances will be described shortly.

We can keep merging clusters until all the points are merged into one cluster. A tree can be used to visualize the merging process. This tree is called a dendrogram.

The key technical detail here is how we define the between-cluster distance. At the bottom level of the dendrogram where every cluster only contains a single point, the between-cluster distance is simply the Euclidian distance between the points.  (In some applications, the between-point distances are pre-given.)  However, once we merge two points into a cluster how do we get the distance between the new cluster and the existing clusters?

The idea is to somehow aggregate the distances between the objects contained in the clusters.

  • For clusters containing only one data point, the between-cluster distance is the between-object distance.
  • For clusters containing multiple data points, the between-cluster distance is an agglomerative version of the between-object distances. There are a number of ways to accomplish this. Examples of these aggregated distances include the minimum or maximum between-object distances for pairs of objects across the two clusters.

For example, suppose we have two clusters and each one has three points. One thing we could do is to find distances between these points. In this case, there would be nine distances between pairs of points across the two clusters. The minimum (or maximum, or average) of the nine distances can be used as the distance between the two clusters.

plot

How do we decide which aggregation scheme to use? Depending on how we update the distances, dramatically different results may come up. Therefore, it is always good practice to look at the results using scatterplots or other visualization methods instead of blindly taking the output of any algorithm. Clustering is inevitably subjective since there is no gold standard.

Normally the agglomerative between-cluster distance can be computed recursively. The aggregation as explained above sounds computationally intensive and seemingly impractical. If we have two very large clusters, we have to check all the between-object distances for all pairs across the two clusters. The good news is that for many of these aggregated distances, we can update them recursively without checking all the pairs of objects. The update approach will be described soon.

Example Distances

Let's see how we would update between-cluster distances. Suppose we have two clusters r and s and these two clusters, not necessarily single point clusters, are merged into a new cluster t. Let k be any other existing cluster. We want the distance between the new cluster, t, and the existing cluster, k.

We will get this distance based on the distance between k and the two-component clusters, r and s.

Because r and s have existed, the distance between r and k and the distance between s and k are already computed. Denote the distances by \(D(r, k)\) and \(D(s, k)\).

We list below various ways to get \(D(t, k)\) from \(D(r, k)\) and \(D(s, k)\).

Single-link clustering:

\(D(t, k) = min(D(r, k), D(s, k))\)

\(D(t, k)\) is the minimum distance between two objects in cluster t and k respectively. It can be shown that the above way of updating distance is equivalent to defining the between-cluster distance as the minimum distance between two objects across the two clusters.

 

Complete-link clustering:

\(D(t, k) = max(D(r, k), D(s, k))\)

Here, \(D(t, k)\) is the maximum distance between two objects in cluster t and k.

 

Average linkage clustering:

There are two cases here, the unweighted case and the weighted case.

Unweighted case:

\[D(t,k) =\frac{n_r}{n_r + n_s}D(r,k) +\frac{n_s}{n_r + n_s}D(s,k)\]

Here we need to use the number of points in cluster r and the number of points in cluster s (the two clusters that are being merged together into a bigger cluster), and compute the percentage of points in the two component clusters with respect to the merged cluster. The two distances, \(D(r, k)\) and \(D(s, k)\), are aggregated by a  weighted sum.

Weighted case:

\[D(t,k)=\frac{1}{2}D(r,k) +\frac{1}{2}D(s,k)\]

 

Instead of using the weight proportional to the cluster size, we use the arithmetic mean. While this might look more like an unweighted case, it is actually weighted in terms of the contribution from individual points in the two clusters.  When the two clusters are weighted half and half, any point (i.e., object) in the smaller cluster individually contributes more to the aggregated distance than a point in the larger cluster.  In contrast, if the larger cluster is given proportionally higher weight, any point in either cluster contributes equally to the aggregated distance.

Centroid clustering:

A centroid is computed for each cluster and the distance between clusters is defined as the distance between their respective centroids.

Unweighted case:

\[D(t,k) =\frac{n_r}{n_r + n_s}D(r,k) +\frac{n_s}{n_r + n_s}D(s,k) -\frac{n_r n_s}{n_r + n_s}D(r,s)\]

Weighted case:

\[D(t,k)=\frac{1}{2}D(r,k) +\frac{1}{2}D(s,k) - \frac{1}{4}D(r,s)\]

 

Ward's clustering:

We update the distance using the following formula:

\[D(t,k) =\frac{n_r + n_k}{n_r + n_s+ n_k}D(r,k) +\frac{n_s+ n_k}{n_r + n_s+ n_k}D(s,k) -\frac{n_k}{n_r + n_s+ n_k}D(r,s)\]

This approach attempts to merge the two clusters for which the change in the total variation is minimized. The total variation of a clustering result is defined as the sum of squared-errors between every object and the centroid of the cluster it belongs to.

The dendrogram generated by single-linkage clustering tends to look like a chain. Clusters generated by complete-linkage may not be well separated. Other methods are often intermediate between the two.


12.7 - Pseudo Code

12.7 - Pseudo Code

Psudo Code Steps:

  1. Begin with n clusters, each containing one object and we will number the clusters 1 through n.
  2. Compute the between-cluster distance D(r, s) as the between-object distance of the two objects in r and s respectively, r, s =1, 2, ..., n. Let the square matrix D = (D(r, s)). If the objects are represented by quantitative vectors we can use Euclidean distance.
  3. Next, find the most similar pair of clusters r and s, such that the distance, D(r, s), is minimum among all the pairwise distances.
  4. Merge r and s to a new cluster t and compute the between-cluster distance D(t, k) for any existing cluster kr, s . Once the distances are obtained,  delete the rows and columns corresponding to the old cluster r and s in the D matrix, because r and s do not exist anymore. Then add a new row and column in D corresponding to cluster t.
  5. Repeat Step 3 a total of n − 1 times until there is only one cluster left.

Example

Here is an example to show you how this algorithm works. In this case, we have 11 points consisting of single numbers. Every point is assigned a label. Visually this is what these different approaches might look like…

output

Another Example

Agglomerative clustering of a data set containing 100 points into 9 clusters.

With a single linkage, below, the result is often not appealing. For instance, if we look at the purple square at the lower left area, a single point is a cluster, and there are other clusters comprising single points. And then all of the red circle points are grouped into one big cluster. Why does this happen?

graph

Single linkage is also sometimes called 'friends of friends' linkage method. The single linkage method always picks the minimum distance when it updates. Therefore, two points might be considered close under the single linkage scheme if they can be connected by a chain of points with small distances between any two consecutive points down the chain. The distance between the two endpoints of the chain is only as big as the longest link along the chain. The number of links along the chain does not matter.

Complete linkage, below, gives us a very different result. It tends to generate clusters that are not very well separated.

graph

Average linkage results:

graph

Ward's clustering, below, tends to generate results similar to k-means. It is kind of a greedy version of k-means or a bottom-up version of k-means because the optimization criterion of k-means is the same as the criterion used for picking clusters to merge in Ward’s clustering. K-means operates in a top-down fashion whereas Ward's clustering operates in a bottom-up fashion.


12.8 - R Scripts (Agglomerative Clustering)

12.8 - R Scripts (Agglomerative Clustering)

R

1. Acquire Data

Diabetes data

The diabetes data set is taken from the UCI machine learning database on Kaggle: Pima Indians Diabetes Database

  • 768 samples in the dataset
  • 8 quantitative variables
  • 2 classes; with or without signs of diabetes

Load data into R as follows:

# set the working directory
setwd("C:/STAT 897D data mining")
# comma-delimited data and no header for each variable
RawData = read.table("diabetes.data",sep = ",",header=FALSE)

In RawData, the response variable is its last column; and the remaining columns are the predictor variables.

responseY = RawData[,dim(RawData)[2]]
predictorX = RawData[,1:(dim(RawData)[2]-1)]

2. Agglomerative Clustering

In R, library cluster implements hierarchical clustering using the agglomerative nesting algorithm (agnes). The first argument x in agnes specifies the input data matrix or the dissimilarity matrix, depending on the value of the diss argument. If diss=TRUE, x is assumed to be a dissimilarity matrix. If diss=FALSE, x is treated as a matrix of observations. The argument stand = TRUE indicates that the data matrix is standardized before calculating the dissimilarities.

Each variable (a column in the data matrix) is standardized by first subtracting the mean value of the variable and then dividing the result by the mean absolute deviation of the variable. If x is already a dissimilarity matrix, this argument will be ignored.

To merge two clusters into a new cluster, the argument method specifies the measurement of between-cluster distance. method="single" is for single linkage clustering, method="complete" for complete linkage clustering, and method="average" for average linkage clustering. The default is method="average".

For clarity of illustration, we use only the first 25 observations to run the agglomerative nesting algorithm (agnes). The function as.dendrogram generates a dendrogram using as input the agglomerative clustering result obtained by agnes.

library(cluster)
agn = agnes(x=predictorX[1:25,], diss = FALSE, stand = TRUE,
      method = "average")
DendAgn =as.dendrogram(agn)
plot(DendAgn)

Figure 1 shows the clustering result by average linkage clustering.

R output
Figure 1: Result of average linkage clustering. The numbers at the bottom are the indices for the data points, ranging from 1 to 25. graphic.

Figure 2 shows the clustering result by single linkage, executed by the codes below.

agn = agnes(x=predictorX[1:25,], diss = FALSE, stand = TRUE,
      method = "single")
DendAgn =as.dendrogram(agn)
plot(DendAgn)
R output
Figure 2: Result of single linkage clustering

Figure 3 shows the result by complete linkage, executed by the codes below.

agn = agnes(x=predictorX[1:25,], diss = FALSE, stand = TRUE,
      method = "complete")
DendAgn =as.dendrogram(agn)
plot(DendAgn)
R output
Figure 3: Result of complete linkage clustering graphic.

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility