10.1 - When Data is Linearly Separable

Let us start with a simple two-class problem when data is clearly linearly separable as shown in the diagram below.

image where data is linearly separable

Let the i-th data point be represented by (\(X_i\), \(y_i\)) where \(X_i\) represents the feature vector and \(y_i\) is the associated class label, taking two possible values +1 or -1. In the diagram above the balls having red color has class label +1 and the blue balls have a class label -1, say. A straight line can be drawn to separate all the members belonging to class +1 from all the members belonging to the class -1. The two-dimensional data above are clearly linearly separable.

In fact, an infinite number of straight lines can be drawn to separate the blue balls from the red balls.

The problem, therefore, is which among the infinite straight lines is optimal, in the sense that it is expected to have minimum classification error on a new observation. The straight line is based on the training sample and is expected to classify one or more test samples correctly.

As an illustration, if we consider the black, red and green lines in the diagram above, is any one of them better than the other two? Or are all three of them equally well suited to classify? How is optimality defined here? Intuitively it is clear that if a line passes too close to any of the points, that line will be more sensitive to small changes in one or more points. The green line is close to a red ball. The red line is close to a blue ball. If the red ball changes its position slightly, it may fall on the other side of the green line. Similarly, if the blue ball changes its position slightly, it may be misclassified. Both the green and red lines are more sensitive to small changes in the observations. The black line on the other hand is less sensitive and less susceptible to model variance.

In an n-dimensional space, a hyperplane is a flat subspace of dimension n – 1. For example, in two dimensions a straight line is a one-dimensional hyperplane, as shown in the diagram. In three dimensions, a hyperplane is a flat two-dimensional subspace, i.e. a plane. Mathematically in n dimensions a separating hyperplane is a linear combination of all dimensions equated to 0; i.e.,

\(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + … + \theta_n x_n = 0\)

The scalar \(\theta_0\) is often referred to as a bias. If \(\theta_0 = 0\), then the hyperplane goes through the origin.

A hyperplane acts as a separator. The points lying on two different sides of the hyperplane will make up two different groups.

Basic idea of support vector machines is to find out the optimal hyperplane for linearly separable patterns. A natural choice of separating hyperplane is optimal margin hyperplane (also known as optimal separating hyperplane) which is farthest from the observations. The perpendicular distance from each observation to a given separating hyperplane is computed. The smallest of all those distances is a measure of how close the hyperplane is to the group of observations. This minimum distance is known as the margin. The operation of the SVM algorithm is based on finding the hyperplane that gives the largest minimum distance to the training examples, i.e. to find the maximum margin. This is known as the maximal margin classifier.

A separating hyperplane in two dimension can be expressed as

\(\theta_0 + \theta_1 x_1 + \theta_2 x_2 = 0\)

Hence, any point that lies above the hyperplane, satisfies

\(\theta_0 + \theta_1 x_1 + \theta_2 x_2 > 0\)

and any point that lies below the hyperplane, satisfies

\(\theta_0 + \theta_1 x_1 + \theta_2 x_2 < 0\)

The coefficients or weights \(θ_1\) and \(θ_2\) can be adjusted so that the boundaries of the margin can be written as

\(H_1: \theta_0 + \theta_1 x_{1i} + \theta_2 x_{2i} \ge 1, \text{for} y_i = +1\)

\(H_2: \theta_0 + θ\theta_1 x_{1i} + \theta_2 x_{2i} \le -1, \text{for} y_i = -1\)

This is to ascertain that any observation that falls on or above \(H_1\) belongs to class +1 and any observation that falls on or below \(H_2\), belongs to class -1. Alternatively, we may write

\(y_i (\theta_0 + \theta_1 x_{1i} + \theta_2 x_{2i}) \le \text{for every observation}\)

The boundaries of the margins, \(H_1\) and \(H_2\), are themselves hyperplanes too. The training data that falls exactly on the boundaries of the margin are called the support vectors as they support the maximal margin hyperplane in the sense that if these points are shifted slightly, then the maximal margin hyperplane will also shift.

Note that the maximal margin hyperplane depends directly only on these support vectors.

If any of the other points change, the maximal margin hyperplane does not change until the movement affects the boundary conditions or the support vectors. The support vectors are the most difficult to classify and give the most information regarding classification. Since the support vectors lie on or closest to the decision boundary, they are the most essential or critical data points in the training set.

plot with support vectors

For a general n-dimensional feature space, the defining equation becomes

\(y_i (\theta_0 + \theta_1 x_{2i} + \theta_2 x_{2i} + … + θn x_ni)\ge  1, \text{for every observation}\)

If the vector of the weights is denoted by \(\Theta\) and \(|\Theta|\) is the norm of this vector, then it is easy to see that the size of the maximal margin is \(\dfrac{2}{|\Theta|}\). Finding the maximal margin hyperplanes and support vectors is a problem of convex quadratic optimization. It is important to note that the complexity of SVM is characterized by the number of support vectors, rather than the dimension of the feature space. That is the reason SVM has a comparatively less tendency to overfit. If all data points other than the support vectors are removed from the training data set, and the training algorithm is repeated, the same separating hyperplane would be found. The number of support vectors provides an upper bound to the expected error rate of the SVM classifier, which happens to be independent of data dimensionality. An SVM with a small number of support vectors has good generalization, even when the data has high dimensionality.