1(b).2 - Numerical Summarization

1(b).2 - Numerical Summarization

Summary Statistics

The vast amount of numbers on a large number of variables need to be properly organized to extract information from them. Broadly speaking there are two methods to summarize data: visual summarization and numerical summarization. Both have their advantages and disadvantages and applied jointly they will get the maximum information from raw data.

Summary statistics are numbers computed from the sample that present a summary of the attributes.

Measures of Location

They are single numbers representing a set of observations. Measures of location also include measures of central tendency. Measures of central tendency can also be taken as the most representative values of the set of observations. The most common measures of location are the Mean, the Median, the Mode, and the Quartiles.

Mean
the arithmetic average of all the observations. The mean equals the sum of all observations divided by the sample size
Median
the middle-most value of the ranked set of observations so that half the observations are greater than the median and the other half is less. Median is a robust measure of central tendency
Mode
the most frequently occurring value in the data set. This makes more sense when attributes are not continuous
Quartiles
division points which split data into four equal parts after rank-ordering them.
Division points are called Q1 (the first quartile), Q2 (the second quartile or median), and Q3 (the third quartile)

Similarly, Deciles and Percentiles are defined as division points that divide the rank-ordered data into 10 and 100 equal segments. 

Measures of Spread

Measures of location are not enough to capture all aspects of the attributes. Measures of dispersion are necessary to understand the variability of the data. The most common measure of dispersion is the Variance, the Standard Deviation, the Interquartile Range and Range.

Variance
measures how far data values lie from the mean. It is defined as the average of the squared differences between the mean and the individual data values
Standard Deviation
is the square root of the variance.  It is defined as the average distance between the mean and the individual data values
Interquartile range (IQR)
is the difference between Q3 and Q1. IQR contains the middle 50% of data
Range
is the difference between the maximum and minimum values in the sample

Measures of Skewness

In addition to the measures of location and dispersion, the arrangement of data or the shape of the data distribution is also of considerable interest. The most 'well-behaved' distribution is a symmetric distribution where the mean and the median are coincident. The symmetry is lost if there exists a tail in either direction. Skewness measures whether or not a distribution has a single long tail.

Skewness is measured as:

\( \dfrac{\sqrt{n} \left( \Sigma \left(x_{i} - \bar{x} \right)^{3} \right)}{\left(\Sigma \left(x_{i} - \bar{x} \right)^{2}\right)^{\frac{3}{2}}} \)

The figure below gives examples of symmetric and skewed distributions. Note that these diagrams are generated from theoretical distributions and in practice one is likely to see only approximations.

example of a symmetric distribution

 

example of a right skewed distribution

 

example of a left skewed distribution

Try it!

Calculate the answers to these questions then click the icon on the left to reveal the answer.

Suppose we have the data: 3, 5, 6, 9, 0, 10, 1, 3, 7, 4, 8. Calculate the following summary statistics:
  • Mean
  • Median
  • Mode
  • Q1 and Q3
  • Variance and Standard Deviation
  • IQR
  • Range
  • Skewness
  • Mean: (3+5+6+9+0+10+1+3+7+4+8)/11= 5.091.
  • Median: The ordered data is 0, 1, 3, 3, 4, 5, 6, 7, 8, 9, 10. Thus, 5 is the median.
  • Mode: 3.
  • Q1 and Q3: Q1 is 3 and Q3 is 8.
  • Variance and Standard Deviation: Variance is 10.491 (=((3-5.091)2+...+(8-5.091)2)/10). Thus, the standard deviation is the square root of 10.491, i.e. 3.239.
  • IQR: Q3-Q1=8-3=5.
  • Range: max-min=10-0=10.
  • Skewness: -0.03.

Measures of Correlation

All the above summary statistics are applicable only for univariate data where information on a single attribute is of interest. Correlation describes the degree of the linear relationship between two attributes, X and Y.

With X taking the values x(1), … , x(n) and Y taking the values y(1), … , y(n), the sample correlation coefficient is defined as:

\(\rho (X,Y)=\dfrac{\sum_{i=1}^{n}\left ( x(i)-\bar{x} \right )\left ( y(i)-\bar{y} \right )}{\left( \sum_{i=1}^{n}\left ( x(i)-\bar{x} \right )^2\sum_{i=1}^{n}\left ( y(i)-\bar{y} \right )^2\right)^\frac{1}{2}}\)

The correlation coefficient is always between -1 (perfect negative linear relationship) and +1 (perfect positive linear relationship). If the correlation coefficient is 0, then there is no linear relationship between X and Y.

In the figure below a set of representative plots are shown for various values of the population correlation coefficient ρ ranging from - 1 to + 1. At the two extreme values, the relation is a perfectly straight line. As the value of ρ approaches 0, the elliptical shape becomes round and then it moves again towards an elliptical shape with the principal axis in the opposite direction. 

example correlation coefficients

Try It!

Try the applet "CorrelationPicture" and "CorrelationPoints" from the University of Colorado at Boulder.

Try the applet "Guess the Correlation" from the Rossman/Chance Applet Collection.


1(b).2.1: Measures of Similarity and Dissimilarity

1(b).2.1: Measures of Similarity and Dissimilarity

Similarity and Dissimilarity

Distance or similarity measures are essential in solving many pattern recognition problems such as classification and clustering. Various distance/similarity measures are available in the literature to compare two data distributions.  As the names suggest, a similarity measures how close two distributions are. For multivariate data complex summary methods are developed to answer this question.

Similarity Measure
Numerical measure of how alike two data objects often fall between 0 (no similarity) and 1 (complete similarity)
Dissimilarity Measure
Numerical measure of how different two data objects are range from 0 (objects are alike) to \(\infty\) (objects are different)
Proximity
refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes

Here, p and q are the attribute values for two data objects.

Attribute Type Similarity Dissimilarity
Nominal \(s=\begin{cases}
1 & \text{ if } p=q \\
0 & \text{ if } p\neq q
\end{cases}\)
\(d=\begin{cases}
0 & \text{ if } p=q \\
1 & \text{ if } p\neq q
\end{cases}\)
Ordinal

\(s=1-\dfrac{\left \| p-q \right \|}{n-1}\)

(values mapped to integer 0 to n-1, where n is the number of values)

\(d=\dfrac{\left \| p-q \right \|}{n-1}\)
Interval or Ratio \(s=1-\left \| p-q \right \|,  s=\frac{1}{1+\left \| p-q \right \|}\) \(d=\left \| p-q \right \|\)


Distance, such as the Euclidean distance, is a dissimilarity measure and has some well-known properties: Common Properties of Dissimilarity Measures

  1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
  2. d(p, q) = d(q,p) for all p and q,
  3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

A distance that satisfies these properties is called a metric. Following is a list of several common distance measures to compare multivariate data. We will assume that the attributes are all continuous.

 

Euclidean Distance

Assume that we have measurements \(x_{ik}\), \(i = 1 , \ldots , N\), on variables \(k = 1 , \dots , p\) (also called attributes).

The Euclidean distance between the ith and jth objects is

\(d_E(i, j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk}  \right) ^2\right)^\frac{1}{2}\)

for every pair (i, j) of observations.

The weighted Euclidean distance is:

\(d_{WE}(i, j)=\left(\sum_{k=1}^{p}W_k\left(x_{ik}-x_{jk}  \right) ^2\right)^\frac{1}{2}\)

If scales of the attributes differ substantially, standardization is necessary.

 

Minkowski Distance

The Minkowski distance is a generalization of the Euclidean distance.

With the measurement, \(x _ { i k } , i = 1 , \dots , N , k = 1 , \dots , p\), the Minkowski distance is

\(d_M(i, j)=\left(\sum_{k=1}^{p}\left | x_{ik}-x_{jk}  \right | ^ \lambda \right)^\frac{1}{\lambda} \)

where \(\lambda \geq 1\).  It is also called the \(L_λ\) metric.

  • \(\lambda = 1 : L _ { 1 }\) metric, Manhattan or City-block distance.
  • \(\lambda = 2 : L _ { 2 }\) metric, Euclidean distance.
  • \(\lambda \rightarrow \infty : L _ { \infty }\) metric, Supremum distance.

\(  \lim{\lambda \to \infty}=\left( \sum_{k=1}^{p}\left | x_{ik}-x_{jk}  \right | ^ \lambda \right) ^\frac{1}{\lambda}  =\text{max}\left( \left | x_{i1}-x_{j1}\right|  , ... ,  \left | x_{ip}-x_{jp}\right| \right) \)

Note that λ and p are two different parameters. Dimension of the data matrix remains finite.

 

Mahalanobis Distance

Let X be a N × p matrix. Then the \(i^{th}\) row of X is

\(x_{i}^{T}=\left( x_{i1}, ... , x_{ip} \right)\)

The Mahalanobis distance is

\(d_{MH}(i, j)=\left( \left( x_i - x_j\right)^T \Sigma^{-1} \left( x_i - x_j\right)\right)^\frac{1}{2}\)

where \(∑\) is the p×p sample covariance matrix.

Try it!

Calculate the answers to these questions by yourself and then click the icon on the left to reveal the answer.

  1. Calculate the Euclidan distances.
  2. Calculate the Minkowski distances (\(\lambda = 1 \text { and } \lambda \rightarrow \infty\) cases).
  1. Euclidean distances are:

    \(d _ { E } ( 1,2 ) = \left( ( 1 - 1 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 1 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 1 ) ^ { 2 } \right) ^ { 1 / 2 } = 3.162\)

    \(d _ { E } ( 1,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 2.646\)

    \(d _ { E } ( 2,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 1.732\)

  2. Minkowski distances (when \(\lambda = 1\) ) are:

    \(d _ { M } ( 1,2 ) = | 1 - 1 | + | 3 - 2 | + | 1 - 1 | + | 2 - 2 | + | 4 - 1 | = 4\)

    \(d _ { M } ( 1,3 ) = | 1 - 2 | + | 3 - 2 | + | 1 - 2 | + | 2 - 2 | + | 4 - 2 | = 5\)

    \(d _ { M } ( 2,3 ) = | 1 - 2 | + | 2 - 2 | + | 1 - 2 | + | 2 - 2 | + | 1 - 2 | = 3\)

    Minkowski distances \(( \text { when } \lambda \rightarrow \infty )\) are:

    \(d _ { M } ( 1,2 ) = \max ( | 1 - 1 | , | 3 - 2 | , | 1 - 1 | , | 2 - 2 | , | 4 - 1 | ) = 3\)

    \(d _ { M } ( 1,3 ) = 2 \text { and } d _ { M } ( 2,3 ) = 1\) 

 

  1. Calculate the Minkowski distance \(( \lambda = 1 , \lambda = 2 , \text { and } \lambda \rightarrow \infty \text { cases) }\) between the first and second objects.
  2. Calculate the Mahalanobis distance between the first and second objects.
  1. Minkowski distance is:

    \(\lambda = 1 . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12\)

    \(\lambda = \text{2. } \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \mathrm { d } _ { \mathrm { E } } ( 1,2 ) = \left( ( 2 - 10 ) ^ { 2 } + ( 3 - 7 ) ^ { 2 } \right) ^ { 1 / 2 } = 8.944\)

    \(\lambda \rightarrow \infty . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\)

  2. \(\lambda = \text{1 .} \operatorname { d_M } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12 . \lambda = \text{2 .} \operatorname { d_M } ( 1,2 ) = \operatorname { dE } ( 1,2 ) = ( ( 2 - 10 ) 2 + ( 3 - 7 ) 2 ) 1 / 2 = 8.944 . \lambda \rightarrow \infty\). \(\operatorname { d_M } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\). Since \(\Sigma = \left( \begin{array} { l l } { 19 } & { 11 } \\ { 11 } & { 7 } \end{array} \right)\) we have \(\Sigma ^ { - 1 } = \left( \begin{array} { c c } { 7 / 12 } & { - 11 / 12 } \\ { - 11 / 12 } & { 19 / 12 } \end{array} \right)\) Mahalanobis distance is: \(d _ { M H } ( 1,2 ) = 2\)

R code for Mahalanobis distance


# Mahalanobis distance calculation 
d1 = c(2, 3) # each observation 
d2 = c(10, 7) 
d3 = c(3, 2) 

# Get covariance matrix using "ALL" observations 
cov_all = cov(rbind(d1, d2, d3)) 
cov_all 

# Inverse covariance matrix is given by: 
solve(cov_all) 

# Mahalanobis distance is given by: 
Mahalanobis_dist = sqrt( matrix(d1-d2,1,2)%*%solve(cov_all)%*%matrix(d1-d2,2,1) ) 
Mahalanobis_dist 

Common Properties of Similarity Measures

Similarities have some well-known properties:

  1. s(p, q) = 1 (or maximum similarity) only if p = q,
  2. s(p, q) = s(q, p) for all p and q, where s(p, q) is the similarity between data objects, p and q.

 

Similarity Between Two Binary Variables

 

The above similarity or distance measures are appropriate for continuous variables. However, for binary variables a different approach is necessary.

  q=1 q=0
p=1 n1,1 n1,0
p=0 n0,1 n0,0

 

 Simple Matching and Jaccard Coefficients

  • Simple matching coefficient \(= \left( n _ { 1,1 } + n _ { 0,0 } \right) / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } + n _ { 0,0 } \right)\).
  • Jaccard coefficient \(= n _ { 1,1 } / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } \right)\).

Try it!

 Calculate the answers to the question and then click the icon on the left to reveal the answer.

   Given data:

  • p = 1 0 0 0 0 0 0 0 0 0
  • q = 0 0 0 0 0 0 1 0 0 1

The frequency table is: 

  q=1 q=0
p=1 0 1
p=0 2 7

Calculate the Simple matching coefficient and the Jaccard coefficient.

  • Simple matching coefficient = (0 + 7) / (0 + 1 + 2 + 7) = 0.7.
  • Jaccard coefficient = 0 / (0 + 1 + 2) = 0.

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility