1(b).2.1: Measures of Similarity and Dissimilarity

1(b).2.1: Measures of Similarity and Dissimilarity

Similarity and Dissimilarity

Distance or similarity measures are essential in solving many pattern recognition problems such as classification and clustering. Various distance/similarity measures are available in the literature to compare two data distributions.  As the names suggest, a similarity measures how close two distributions are. For multivariate data complex summary methods are developed to answer this question.

Similarity Measure
Numerical measure of how alike two data objects often fall between 0 (no similarity) and 1 (complete similarity)
Dissimilarity Measure
Numerical measure of how different two data objects are range from 0 (objects are alike) to \(\infty\) (objects are different)
Proximity
refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes

Here, p and q are the attribute values for two data objects.

Attribute Type Similarity Dissimilarity
Nominal \(s=\begin{cases}
1 & \text{ if } p=q \\
0 & \text{ if } p\neq q
\end{cases}\)
\(d=\begin{cases}
0 & \text{ if } p=q \\
1 & \text{ if } p\neq q
\end{cases}\)
Ordinal

\(s=1-\dfrac{\left \| p-q \right \|}{n-1}\)

(values mapped to integer 0 to n-1, where n is the number of values)

\(d=\dfrac{\left \| p-q \right \|}{n-1}\)
Interval or Ratio \(s=1-\left \| p-q \right \|,  s=\frac{1}{1+\left \| p-q \right \|}\) \(d=\left \| p-q \right \|\)


Distance, such as the Euclidean distance, is a dissimilarity measure and has some well-known properties: Common Properties of Dissimilarity Measures

  1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
  2. d(p, q) = d(q,p) for all p and q,
  3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

A distance that satisfies these properties is called a metric. Following is a list of several common distance measures to compare multivariate data. We will assume that the attributes are all continuous.

 

Euclidean Distance

Assume that we have measurements \(x_{ik}\), \(i = 1 , \ldots , N\), on variables \(k = 1 , \dots , p\) (also called attributes).

The Euclidean distance between the ith and jth objects is

\(d_E(i, j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk}  \right) ^2\right)^\frac{1}{2}\)

for every pair (i, j) of observations.

The weighted Euclidean distance is:

\(d_{WE}(i, j)=\left(\sum_{k=1}^{p}W_k\left(x_{ik}-x_{jk}  \right) ^2\right)^\frac{1}{2}\)

If scales of the attributes differ substantially, standardization is necessary.

 

Minkowski Distance

The Minkowski distance is a generalization of the Euclidean distance.

With the measurement, \(x _ { i k } , i = 1 , \dots , N , k = 1 , \dots , p\), the Minkowski distance is

\(d_M(i, j)=\left(\sum_{k=1}^{p}\left | x_{ik}-x_{jk}  \right | ^ \lambda \right)^\frac{1}{\lambda} \)

where \(\lambda \geq 1\).  It is also called the \(L_λ\) metric.

  • \(\lambda = 1 : L _ { 1 }\) metric, Manhattan or City-block distance.
  • \(\lambda = 2 : L _ { 2 }\) metric, Euclidean distance.
  • \(\lambda \rightarrow \infty : L _ { \infty }\) metric, Supremum distance.

\(  \lim{\lambda \to \infty}=\left( \sum_{k=1}^{p}\left | x_{ik}-x_{jk}  \right | ^ \lambda \right) ^\frac{1}{\lambda}  =\text{max}\left( \left | x_{i1}-x_{j1}\right|  , ... ,  \left | x_{ip}-x_{jp}\right| \right) \)

Note that λ and p are two different parameters. Dimension of the data matrix remains finite.

 

Mahalanobis Distance

Let X be a N × p matrix. Then the \(i^{th}\) row of X is

\(x_{i}^{T}=\left( x_{i1}, ... , x_{ip} \right)\)

The Mahalanobis distance is

\(d_{MH}(i, j)=\left( \left( x_i - x_j\right)^T \Sigma^{-1} \left( x_i - x_j\right)\right)^\frac{1}{2}\)

where \(∑\) is the p×p sample covariance matrix.

Try it!

Calculate the answers to these questions by yourself and then click the icon on the left to reveal the answer.

  1. Calculate the Euclidan distances.
  2. Calculate the Minkowski distances (\(\lambda = 1 \text { and } \lambda \rightarrow \infty\) cases).
  1. Euclidean distances are:

    \(d _ { E } ( 1,2 ) = \left( ( 1 - 1 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 1 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 1 ) ^ { 2 } \right) ^ { 1 / 2 } = 3.162\)

    \(d _ { E } ( 1,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 2.646\)

    \(d _ { E } ( 2,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 1.732\)

  2. Minkowski distances (when \(\lambda = 1\) ) are:

    \(d _ { M } ( 1,2 ) = | 1 - 1 | + | 3 - 2 | + | 1 - 1 | + | 2 - 2 | + | 4 - 1 | = 4\)

    \(d _ { M } ( 1,3 ) = | 1 - 2 | + | 3 - 2 | + | 1 - 2 | + | 2 - 2 | + | 4 - 2 | = 5\)

    \(d _ { M } ( 2,3 ) = | 1 - 2 | + | 2 - 2 | + | 1 - 2 | + | 2 - 2 | + | 1 - 2 | = 3\)

    Minkowski distances \(( \text { when } \lambda \rightarrow \infty )\) are:

    \(d _ { M } ( 1,2 ) = \max ( | 1 - 1 | , | 3 - 2 | , | 1 - 1 | , | 2 - 2 | , | 4 - 1 | ) = 3\)

    \(d _ { M } ( 1,3 ) = 2 \text { and } d _ { M } ( 2,3 ) = 1\) 

 

  1. Calculate the Minkowski distance \(( \lambda = 1 , \lambda = 2 , \text { and } \lambda \rightarrow \infty \text { cases) }\) between the first and second objects.
  2. Calculate the Mahalanobis distance between the first and second objects.
  1. Minkowski distance is:

    \(\lambda = 1 . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12\)

    \(\lambda = \text{2. } \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \mathrm { d } _ { \mathrm { E } } ( 1,2 ) = \left( ( 2 - 10 ) ^ { 2 } + ( 3 - 7 ) ^ { 2 } \right) ^ { 1 / 2 } = 8.944\)

    \(\lambda \rightarrow \infty . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\)

  2. \(\lambda = \text{1 .} \operatorname { d_M } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12 . \lambda = \text{2 .} \operatorname { d_M } ( 1,2 ) = \operatorname { dE } ( 1,2 ) = ( ( 2 - 10 ) 2 + ( 3 - 7 ) 2 ) 1 / 2 = 8.944 . \lambda \rightarrow \infty\). \(\operatorname { d_M } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\). Since \(\Sigma = \left( \begin{array} { l l } { 19 } & { 11 } \\ { 11 } & { 7 } \end{array} \right)\) we have \(\Sigma ^ { - 1 } = \left( \begin{array} { c c } { 7 / 12 } & { - 11 / 12 } \\ { - 11 / 12 } & { 19 / 12 } \end{array} \right)\) Mahalanobis distance is: \(d _ { M H } ( 1,2 ) = 2\)

R code for Mahalanobis distance


# Mahalanobis distance calculation 
d1 = c(2, 3) # each observation 
d2 = c(10, 7) 
d3 = c(3, 2) 

# Get covariance matrix using "ALL" observations 
cov_all = cov(rbind(d1, d2, d3)) 
cov_all 

# Inverse covariance matrix is given by: 
solve(cov_all) 

# Mahalanobis distance is given by: 
Mahalanobis_dist = sqrt( matrix(d1-d2,1,2)%*%solve(cov_all)%*%matrix(d1-d2,2,1) ) 
Mahalanobis_dist 

Common Properties of Similarity Measures

Similarities have some well-known properties:

  1. s(p, q) = 1 (or maximum similarity) only if p = q,
  2. s(p, q) = s(q, p) for all p and q, where s(p, q) is the similarity between data objects, p and q.

 

Similarity Between Two Binary Variables

 

The above similarity or distance measures are appropriate for continuous variables. However, for binary variables a different approach is necessary.

  q=1 q=0
p=1 n1,1 n1,0
p=0 n0,1 n0,0

 

 Simple Matching and Jaccard Coefficients

  • Simple matching coefficient \(= \left( n _ { 1,1 } + n _ { 0,0 } \right) / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } + n _ { 0,0 } \right)\).
  • Jaccard coefficient \(= n _ { 1,1 } / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } \right)\).

Try it!

 Calculate the answers to the question and then click the icon on the left to reveal the answer.

   Given data:

  • p = 1 0 0 0 0 0 0 0 0 0
  • q = 0 0 0 0 0 0 1 0 0 1

The frequency table is: 

  q=1 q=0
p=1 0 1
p=0 2 7

Calculate the Simple matching coefficient and the Jaccard coefficient.

  • Simple matching coefficient = (0 + 7) / (0 + 1 + 2 + 7) = 0.7.
  • Jaccard coefficient = 0 / (0 + 1 + 2) = 0.

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility