# 1(b).2.1: Measures of Similarity and Dissimilarity

### Similarity and Dissimilarity

Distance or similarity measures are essential to solve many pattern recognition problems such as classification and clustering. Various distance/similarity measures are available in literature to compare two data distributions. As the names suggest, a similarity measures how close two distributions are. For multivariate data complex summary methods are developed to answer this question.

**Similarity Measure**

- Numerical measure of how alike two data objects are.
- Often falls between 0 (no similarity) and 1 (complete similarity).

**Dissimilarity Measure**

- Numerical measure of how different two data objects are.
- Range from 0 (objects are alike) to ∞ (objects are different).

**Proximity **refers to a similarity or dissimilarity.

#### Similarity/Dissimilarity for Simple Attributes

Here, *p* and *q* are the attribute values for two data objects.

Attribute Type |
Similarity |
Dissimilarity |

Nominal | \(s=\begin{cases} 1 & \text{ if } p=q \\ 0 & \text{ if } p\neq q \end{cases}\) |
\(d=\begin{cases} 0 & \text{ if } p=q \\ 1 & \text{ if } p\neq q \end{cases}\) |

Ordinal |
\(s=1-\frac{\left \| p-q \right \|}{n-1}\) (values mapped to integer 0 to n-1, where n is the number of values) |
\(d=\frac{\left \| p-q \right \|}{n-1}\) |

Interval or Ratio | \(s=1-\left \| p-q \right \|, s=\frac{1}{1+\left \| p-q \right \|}\) | \(d=\left \| p-q \right \|\) |

### Common Properties of Dissimilarity Measures

**Distance**, such as the Euclidean distance, is a dissimilarity measure and has some well known properties:

*d*(*p*,*q*) ≥ 0 for all*p*and*q*, and*d*(*p*,*q*) = 0 if and only if*p*=*q*,*d*(*p*,*q*) =*d(q,p)*for all*p*and*q*,*d*(*p*,*r*) ≤*d*(*p*,*q*) +*d*(*q*,*r*) for all*p*,*q*, and r, where*d*(*p*,*q*) is the distance (dissimilarity) between points (data objects),*p*and*q*.

A distance that satisfies these properties is called a **metric**.Following is a list of several common distance measures to compare multivariate data. We will assume that the attributes are all continuous.

#### Euclidean Distance

Assume that we have measurements x_{ik}, *i* = 1, … , *N*, on variables *k* = 1, … , *p* (also called attributes).

The Euclidean distance between the *i*th and *j*th objects is

\[d_E(i, j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\]

for every pair (i, j) of observations.

The weighted Euclidean distance is

\[d_{WE}(i, j)=\left(\sum_{k=1}^{p}W_k\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\]

If scales of the attributes differ substantially, standardization is necessary.

#### Minkowski Distance

The Minkowski distance is a generalization of the Euclidean distance.

With the measurement, *x*_{ik} , *i* = 1, … , *N*, k* *= 1, … , *p*, the Minkowski distance is

\[d_M(i, j)=\left(\sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right)^\frac{1}{\lambda}, \]

where λ ≥ 1. It is also called the L_{λ }metric.

- λ = 1 : L
_{1}metric, Manhattan or City-block distance. - λ = 2 : L
_{2}metric, Euclidean distance. - λ → ∞ : L
_{∞}metric, Supremum distance.

\[ \lim{\lambda \to \infty}=\left( \sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right) ^\frac{1}{\lambda} =\text{max}\left( \left | x_{i1}-x_{j1}\right| , ... , \left | x_{ip}-x_{jp}\right| \right) \]

Note that λ and p are two different parameters. Dimension of the data matrix remains finite.

#### Mahalanobis Distance

Let **X** be a N × p matrix. Then the i^{th} row of **X **is

\[x_{i}^{T}=\left( x_{i1}, ... , x_{ip} \right)\]

The Mahalanobis distance is

\[d_{MH}(i, j)=\left( \left( x_i - x_j\right)^T \Sigma^{-1} \left( x_i - x_j\right)\right)^\frac{1}{2}\]

where ∑ is the p×p sample covariance matrix.

### Self-check

#### Think About It!

*Calculate the answers to these questions by yourself and then click the icon on the left to reveal the answer.*

1. We have \(X= \begin{pmatrix}

1 & 3 & 1 & 2 & 4\\

1 & 2 & 1 & 2 & 1\\

2 & 2 & 2 & 2 & 2

\end{pmatrix}\).

- Calculate the Euclidan distances.
- Calculate the Minkowski distances (λ=1 and λ→∞ cases).

2. We have \(X= \begin{pmatrix}

2 & 3 \\

10 & 7 \\

3 & 2

\end{pmatrix}\).

- Calculate the Minkowski distance (λ = 1, λ = 2, and λ → ∞ cases) between the first and second objects.
- Calculate the Mahalanobis distance between the first and second objects.

### Common Properties of Similarity Measures

Similarities have some well known properties:

*s*(*p*,*q*) = 1 (or maximum similarity) only if*p*=*q*,*s*(*p*,*q*) =*s*(*q*,*p*) for all*p*and*q*, where*s*(*p*,*q*) is the similarity between data objects,*p*and*q*.

#### Similarity Between Two Binary Variables

The above similarity or distance measures are appropriate for continuous variables. However, for binary variables a different approach is necessary.

Simple Matching and Jaccard Coefficients

- Simple matching coefficient = (
*n*_{1,1}+*n*_{0,0}) / (*n*_{1,1}+*n*_{1,0}+*n*_{0,1}+*n*_{0,0}). - Jaccard coefficient =
*n*_{1,1}/ (*n*_{1,1}+*n*_{1,0}+*n*_{0,1}).