Topic 1: Robust Regression

Topic 1: Robust Regression

Overview

So far we have utilized ordinary least squares for estimating the regression line. However, aspects of the data (such as nonconstant variance or outliers) may require a different method for estimating the regression line. This lesson provides an introduction to some of the other available methods for estimating regression lines. To help with the discussions in this lesson, recall that the ordinary least squares estimate is

\(\begin{align*} \hat{\beta}_{\textrm{OLS}}&=\arg\min_{\beta}\sum_{i=1}^{n}\epsilon_{i}^{2} \\ &=(\textbf{X}^{\textrm{T}}\textbf{X})^{-1}\textbf{X}^{\textrm{T}}\textbf{Y} \end{align*}\)

Because of the alternative estimates to be introduced, the ordinary least squares estimate is written here as \(\hat{\beta}_{\textrm{OLS}}\) instead of b.

Objectives

Upon completion of this lesson, you should be able to:

  • Explain the idea behind weighted least squares.
  • Apply weighted least squares to regression examples with nonconstant variance.
  • Understand the purpose behind robust regression methods.

Lesson 13 Code Files

Below is a zip file that contains all the data sets used in this lesson:

STAT501_Lesson13.zip

  • ca_learning.txt
  • home_price.txt
  • market_share.txt
  • quality_measure.txt

T.1.1 - Robust Regression Methods

T.1.1 - Robust Regression Methods
Note! that the material in Sections 13.3-13.6 is considerably more technical than the preceding Lessons. It is offered as an introduction to this advanced topic and, given the technical nature of the material, it could be considered optional in the context of this course

The ordinary least squares estimates for linear regression are optimal when all of the regression assumptions are valid. When some of these assumptions are invalid, least squares regression can perform poorly. Residual diagnostics can help guide you to where the breakdown in assumptions occurs but can be time-consuming and sometimes difficult for the untrained eye. Robust regression methods provide an alternative to least squares regression by requiring less restrictive assumptions. These methods attempt to dampen the influence of outlying cases in order to provide a better fit to the majority of the data.

Outliers have a tendency to pull the least squares fit too far in their direction by receiving much more "weight" than they deserve. Typically, you would expect that the weight attached to each observation would be on average 1/n in a data set with n observations. However, outliers may receive considerably more weight, leading to distorted estimates of the regression coefficients. This distortion results in outliers which are difficult to identify since their residuals are much smaller than they would otherwise be (if the distortion wasn't present). As we have seen, scatterplots may be used to assess outliers when a small number of predictors are present. However, the complexity added by additional predictor variables can hide the outliers from view in these scatterplots. Robust regression down-weights the influence of outliers, which makes their residuals larger and easier to identify.

For our first robust regression method, suppose we have a data set of size n such that

\(\begin{align*} y_{i}&=\textbf{x}_{i}^{\textrm{T}}\beta+\epsilon_{i} \\ \Rightarrow\epsilon_{i}(\beta)&=y_{i}-\textbf{x}_{i}^{\textrm{T}}\beta, \end{align*}\)

where \(i=1,\ldots,n\). Here we have rewritten the error term as \(\epsilon_{i}(\beta)\) to reflect the error term's dependency on the regression coefficients. Ordinary least squares are sometimes known as \(L_{2}\)-norm regression since it is minimizing the \(L_{2}\)-norm of the residuals (i.e., the squares of the residuals). Thus, observations with high residuals (and high squared residuals) will pull the least squares to fit more in that direction. An alternative is to use what is sometimes known as least absolute deviation (or \(L_{1}\)-norm regression), which minimizes the \(L_{1}\)-norm of the residuals (i.e., the absolute value of the residuals). Formally defined, the least absolute deviation estimator is

\(\begin{equation*} \hat{\beta}_{\textrm{LAD}}=\arg\min_{\beta}\sum_{i=1}^{n}|\epsilon_{i}(\beta)|, \end{equation*}\)

which in turn minimizes the absolute value of the residuals (i.e., \(|r_{i}|\)).

Another quite common robust regression method falls into a class of estimators called M-estimators (and there are also other related classes such as R-estimators and S-estimators, whose properties we will not explore). M-estimators attempt to minimize the sum of a chosen function \(\rho(\cdot)\) which is acting on the residuals. Formally defined, M-estimators are given by

\(\begin{equation*} \hat{\beta}_{\textrm{M}}=\arg\min _{\beta}\sum_{i=1}^{n}\rho(\epsilon_{i}(\beta)). \end{equation*}\)

The M stands for "maximum likelihood" since \(\rho(\cdot)\) is related to the likelihood function for a suitable assumed residual distribution. Notice that, if assuming normality, then \(\rho(z)=\frac{1}{2}z^{2}\) results in the ordinary least squares estimate.

Some M-estimators are influenced by the scale of the residuals, so a scale-invariant version of the M-estimator is used:

\(\begin{equation*} \hat{\beta}_{\textrm{M}}=\arg\min_{\beta}\sum_{i=1}^{n}\rho\biggl(\frac{\epsilon_{i}(\beta)}{\tau}\biggr), \end{equation*}\)

where \(\tau\) is a measure of the scale. An estimate of \(\tau\) is given by

\(\begin{equation*} \hat{\tau}=\frac{\textrm{med}_{i}|r_{i}-\tilde{r}|}{0.6745}, \end{equation*}\)

where \(\tilde{r}\) is the median of the residuals. Minimization of the above is accomplished primarily in two steps:

  1.  Set \(\frac{\partial\rho}{\partial\beta_{j}}=0\) for each \(j=0,1,\ldots,p-1\), resulting in a set of p nonlinear equations\(\begin{equation*} \sum_{i=1}^{n}x_{i,j}\psi\biggl(\dfrac{\epsilon_{i}}{\tau}\biggr)=0, \end{equation*}\) where \(\psi(\cdot)=\rho'(\cdot)\). \(\psi(\cdot)\) is called the influence function.
  2. A numerical method called iteratively reweighted least squares (IRLS) (mentioned in Section 13.1) is used to iteratively estimate the weighted least squares estimate until a stopping criterion is met. Specifically, for iterations \(t=0,1,\ldots\)

    \(\begin{equation*} \hat{\beta}^{(t+1)}=(\textbf{X}^{\textrm{T}}(\textbf{W}^{-1})^{(t)}\textbf{X})^{-1}\textbf{X}^{\textrm{T}}(\textbf{W}^{-1})^{(t)}\textbf{y}, \end{equation*}\)

    where \((\textbf{W}^{-1})^{(t)}=\textrm{diag}(w_{1}^{(t)},\ldots,w_{n}^{(t)})\) such that

    \( w_{i}^{(t)}=\begin{cases}\dfrac{\psi((y_{i}-\textbf{x}_{i}^{\textrm{t}}\beta^{(t)})/\hat{\tau}^{(t)})}{(y_{i}\textbf{x}_{i}^{\textrm{t}}\beta^{(t)})/\hat{\tau}^{(t)}}, & \hbox{if \(y_{i}\neq\textbf{x}_{i}^{\textrm{T}}\beta^{(t)}\);} \\ 1, & \hbox{if \(y_{i}=\textbf{x}_{i}^{\textrm{T}}\beta^{(t)}\).} \end{cases} \)

Three common functions chosen in M-estimation are given below:

Andrews' Sine

\(\begin{align*}\rho(z)&=\begin{cases}\ c[1-\cos(z/c)], & \hbox{if \(|z|<\pi c\);}\\ 2c, & \hbox{if \(|z|\geq\pi c\)} \end{cases}  \\ \psi(z)&=\begin{cases} \sin(z/c), & \hbox{if \(|z|<\pi c\);} \\  0, & \hbox{if \(|z|\geq\pi c\)}  \end{cases} \\ w(z)&=\begin{cases} \frac{\sin(z/c)}{z/c}, & \hbox{if \(|z|<\pi c\);} \\ 0, & \hbox{if \(|z|\geq\pi c\),} \end{cases}  \end{align*}\) where \(c\approx1.339\).

Huber's Method

\(\begin{align*} \rho(z)&=\begin{cases} z^{2}, & \hbox{if \(|z|<c\);} \\   |2z|c-c^{2}, & \hbox{if \(|z|\geq c\)} \end{cases}   \\ \psi(z)&=\begin{cases} z, & \hbox{if \(|z|<c\);} \\   c[\textrm{sgn}(z)], & \hbox{if \(|z|\geq c\)} \end{cases}   \\ w(z)&=\begin{cases} 1, & \hbox{if \(|z|< c\);} \\  c/|z|. & \hbox{if \(|z|\geq c\),} \end{cases}  \end{align*}\) where \(c\approx 1.345\).

Tukey's Biweight

\(\begin{align*} \rho(z)&= \begin{cases} \frac{c^{2}}{3}\biggl\{1-(1-(\frac{z}{c})^{2})^{3}\biggr\}, & \hbox{if \(|z|<c\);} \\ 2c, & \hbox{if \(|z|\geq c\)} \end{cases}  \\ \psi(z)&= \begin{cases} z[1-(\frac{z}{c})^{2}]^{2}, & \hbox{if \(|z|<c\);} \\   0, & \hbox{if \(|z|\geq c\)} \end{cases}   \\ w(z)&= \begin{cases} [1-(\frac{z}{c})^{2}]^{2}, & \hbox{if \(|z|<c\);} \\   0, & \hbox{if \(|z|\geq c\),} \end{cases} \end{align*}\) where \(c\approx 4.685\).


T1.1.1 - Robust Regression Examples

T1.1.1 - Robust Regression Examples

Quality Measurements Dataset

Let us look at the three robust procedures discussed earlier for the Quality Measure data set. These estimates are provided in the table below for comparison with the ordinary least squares estimate.

A comparison of M-estimators with the ordinary least squares estimator for the quality measurements data set (analysis done in R since Minitab does not include these procedures):

Method \(\boldsymbol{\beta_{0}}\) p-value \(\boldsymbol{\beta_{1}}\) p-value
OLS 29.2342 0.0276 0.5235 0.0045
Andrew's Sine 29.0080 0.0177 0.5269 0.0024
Huber's Method 29.5428 0.0204 0.5183 0.0036
Tukey's Biweight 29.1399 0.0203 0.5248 0.0030

While there is not much of a difference here, it appears that Andrew's Sine method is producing the most significant values for the regression estimates. One may wish to then proceed with residual diagnostics and weigh the pros and cons of using this method over ordinary least squares (e.g., interpretability, assumptions, etc.).


T.1.2 - Resistant Regression Methods

T.1.2 - Resistant Regression Methods

The next method we discuss is often used interchangeably with robust regression methods. However, there is a subtle difference between the two methods that are not usually outlined in the literature. Whereas robust regression methods attempt to only dampen the influence of outlying cases, resistant regression methods use estimates that are not influenced by any outliers (this comes from the definition of resistant statistics, which are measures of the data that are not influenced by outliers, such as the median). This is best accomplished by trimming the data, which "trims" extreme values from either end (or both ends) of the range of data values.

There is also one other relevant term when discussing resistant regression methods. Suppose we have a data set \(x_{1},x_{2},\ldots,x_{n}\). The order statistics are simply defined to be the data values arranged in increasing order and are written as \(x_{(1)},x_{(2)},\ldots,x_{(n)}\). Therefore, the minimum and maximum of this data set are \(x_{(1)}\) and \(x_{(n)}\), respectively. As we will see, the resistant regression estimators provided here are all based on the ordered residuals.

We present three commonly used resistant regression methods:

Least Quantile of Squares

The least quantile of squares method minimizes the squared order residual (presumably selected as it is most representative of where the data is expected to lie) and is formally defined by \(\begin{equation*} \hat{\beta}_{\textrm{LQS}}=\arg\min_{\beta}\epsilon_{(\nu)}^{2}(\beta), \end{equation*}\) where \(\nu=P*n\) is the \(P^{\textrm{th}}\) percentile (i.e., \(0<P\leq 1\)) of the empirical data (if \(\nu\) is not an integer, then specify \(\nu\) to be either the next greatest or lowest integer value). A specific case of the least quantile of squares method where p = 0.5 (i.e., the median) and is called the least median of squares method (and the estimate is often written as \(\hat{\beta}_{\textrm{LMS}}\)).

Least Trimmed Sum of Squares

The least trimmed sum of squares method minimizes the sum of the \(h\) smallest squared residuals and is formally defined by \(\begin{equation*} \hat{\beta}_{\textrm{LTS}}=\arg\min_{\beta}\sum_{i=1}^{h}\epsilon_{(i)}^{2}(\beta), \end{equation*}\) where \(h\leq n\). If h = n, then you just obtain \(\hat{\beta}_{\textrm{OLS}}\).

Least Trimmed Sum of Absolute Deviations

The least trimmed sum of absolute deviations method minimizes the sum of the h smallest absolute residuals and is formally defined by \(\begin{equation*} \hat{\beta}_{\textrm{LTA}}=\arg\min_{\beta}\sum_{i=1}^{h}|\epsilon(\beta)|_{(i)}, \end{equation*}\) where again \(h\leq n\). If h = n, then you just obtain \(\hat{\beta}_{\textrm{LAD}}\).

So, which method from robust or resistant regressions do we use? In order to guide you in the decision-making process, you will want to consider both the theoretical benefits of a certain method as well as the type of data you have. The theoretical aspects of these methods that are often cited include their breakdown values and overall efficiency. Breakdown values are a measure of the proportion of contamination (due to outlying observations) that an estimation method can withstand and still maintain to be robust against the outliers. Efficiency is a measure of an estimator's variance relative to another estimator (when it is the smallest it can possibly be, then the estimator is said to be "best"). For example, the least quantile of squares method and least trimmed sum of squares method both have the same maximal breakdown value for certain P, the least median of squares method is of low efficiency, and the least trimmed sum of squares method has the same efficiency (asymptotically) as certain M-estimators. As for your data, if there appear to be many outliers, then a method with a high breakdown value should be used. A preferred solution is to calculate many of these estimates for your data and compare their overall fits, but this will likely be computationally expensive.


T.1.3 - Regression Depth

T.1.3 - Regression Depth

We have discussed the notion of ordering data (e.g., ordering the residuals). The applications we have presented with ordered data have all concerned univariate data sets. However, there are also techniques for ordering multivariate data sets. Statistical depth functions provide a center-outward ordering of multivariate observations, which allows one to define reasonable analogues of univariate order statistics. There are numerous depth functions, which we do not discuss here. However, the notion of statistical depth is also used in the regression setting. Specifically, there is the notion of regression depth, which is a quality measure for robust linear regression.

Statistically speaking, the regression depth of a hyperplane \(\mathcal{H}\) is the smallest number of residuals that need to change the sign to make \(\mathcal{H}\) a nonfit. This definition also has convenient statistical properties, such as invariance under affine transformations, which we do not discuss in greater detail. A regression hyperplane is called a nonfit if it can be rotated horizontally (i.e., parallel to the axis of any of the predictor variables) without passing through any data points. (We count the points exactly on the hyperplane as "passed through".) A nonfit is a very poor regression hyperplane because it is combinatorially equivalent to a horizontal hyperplane, which posits no relationship between predictor and response variables. The regression depth of a hyperplane (say, \(\mathcal{L}\)) is the minimum number of points whose removal makes \(\mathcal{H}\) into a nonfit. For example, consider the data in the figure below.

xY246810468101214

Removing the red circles and rotating the regression line until horizontal (i.e., the dashed blue line) demonstrates that the black line has regression depth 3. Hyperplanes with high regression depth behave well in general error models, including skewed or distributions with heteroscedastic errors.

The regression depth of n points in p dimensions is upper bounded by \(\lceil n/(p+1)\rceil\), where p is the number of variables (i.e., the number of responses plus the number of predictors). In other words, there exist point sets for which no hyperplane has regression depth larger than this bound. For the simple linear regression example in the plot above, this means there is always a line with a regression depth of at least \(\lceil n/3\rceil\).

When confronted with outliers, then you may be confronted with the choice of other regression lines or hyperplanes to consider for your data. Some of these regressions may be biased or altered from the traditional ordinary least squares line. In such cases, regression depth can help provide a measure of a fitted line that best captures the effects due to outliers.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility