T.1.1 - Robust Regression Methods

Note! that the material in Sections 13.3-13.6 is considerably more technical than the preceding Lessons. It is offered as an introduction to this advanced topic and, given the technical nature of the material, it could be considered optional in the context of this course

The ordinary least squares estimates for linear regression are optimal when all of the regression assumptions are valid. When some of these assumptions are invalid, least squares regression can perform poorly. Residual diagnostics can help guide you to where the breakdown in assumptions occurs but can be time-consuming and sometimes difficult for the untrained eye. Robust regression methods provide an alternative to least squares regression by requiring less restrictive assumptions. These methods attempt to dampen the influence of outlying cases in order to provide a better fit to the majority of the data.

Outliers have a tendency to pull the least squares fit too far in their direction by receiving much more "weight" than they deserve. Typically, you would expect that the weight attached to each observation would be on average 1/n in a data set with n observations. However, outliers may receive considerably more weight, leading to distorted estimates of the regression coefficients. This distortion results in outliers which are difficult to identify since their residuals are much smaller than they would otherwise be (if the distortion wasn't present). As we have seen, scatterplots may be used to assess outliers when a small number of predictors are present. However, the complexity added by additional predictor variables can hide the outliers from view in these scatterplots. Robust regression down-weights the influence of outliers, which makes their residuals larger and easier to identify.

For our first robust regression method, suppose we have a data set of size n such that

\(\begin{align*} y_{i}&=\textbf{x}_{i}^{\textrm{T}}\beta+\epsilon_{i} \\ \Rightarrow\epsilon_{i}(\beta)&=y_{i}-\textbf{x}_{i}^{\textrm{T}}\beta, \end{align*}\)

where \(i=1,\ldots,n\). Here we have rewritten the error term as \(\epsilon_{i}(\beta)\) to reflect the error term's dependency on the regression coefficients. Ordinary least squares are sometimes known as \(L_{2}\)-norm regression since it is minimizing the \(L_{2}\)-norm of the residuals (i.e., the squares of the residuals). Thus, observations with high residuals (and high squared residuals) will pull the least squares to fit more in that direction. An alternative is to use what is sometimes known as least absolute deviation (or \(L_{1}\)-norm regression), which minimizes the \(L_{1}\)-norm of the residuals (i.e., the absolute value of the residuals). Formally defined, the least absolute deviation estimator is

\(\begin{equation*} \hat{\beta}_{\textrm{LAD}}=\arg\min_{\beta}\sum_{i=1}^{n}|\epsilon_{i}(\beta)|, \end{equation*}\)

which in turn minimizes the absolute value of the residuals (i.e., \(|r_{i}|\)).

Another quite common robust regression method falls into a class of estimators called M-estimators (and there are also other related classes such as R-estimators and S-estimators, whose properties we will not explore). M-estimators attempt to minimize the sum of a chosen function \(\rho(\cdot)\) which is acting on the residuals. Formally defined, M-estimators are given by

\(\begin{equation*} \hat{\beta}_{\textrm{M}}=\arg\min _{\beta}\sum_{i=1}^{n}\rho(\epsilon_{i}(\beta)). \end{equation*}\)

The M stands for "maximum likelihood" since \(\rho(\cdot)\) is related to the likelihood function for a suitable assumed residual distribution. Notice that, if assuming normality, then \(\rho(z)=\frac{1}{2}z^{2}\) results in the ordinary least squares estimate.

Some M-estimators are influenced by the scale of the residuals, so a scale-invariant version of the M-estimator is used:

\(\begin{equation*} \hat{\beta}_{\textrm{M}}=\arg\min_{\beta}\sum_{i=1}^{n}\rho\biggl(\frac{\epsilon_{i}(\beta)}{\tau}\biggr), \end{equation*}\)

where \(\tau\) is a measure of the scale. An estimate of \(\tau\) is given by

\(\begin{equation*} \hat{\tau}=\frac{\textrm{med}_{i}|r_{i}-\tilde{r}|}{0.6745}, \end{equation*}\)

where \(\tilde{r}\) is the median of the residuals. Minimization of the above is accomplished primarily in two steps:

Set \(\frac{\partial\rho}{\partial\beta_{j}}=0\) for each \(j=0,1,\ldots,p-1\), resulting in a set of p nonlinear equations\(\begin{equation*} \sum_{i=1}^{n}x_{i,j}\psi\biggl(\dfrac{\epsilon_{i}}{\tau}\biggr)=0, \end{equation*}\) where \(\psi(\cdot)=\rho'(\cdot)\). \(\psi(\cdot)\) is called the influence function.
A numerical method called iteratively reweighted least squares (IRLS) (mentioned in Section 13.1) is used to iteratively estimate the weighted least squares estimate until a stopping criterion is met. Specifically, for iterations \(t=0,1,\ldots\)

\(\begin{equation*} \hat{\beta}^{(t+1)}=(\textbf{X}^{\textrm{T}}(\textbf{W}^{-1})^{(t)}\textbf{X})^{-1}\textbf{X}^{\textrm{T}}(\textbf{W}^{-1})^{(t)}\textbf{y}, \end{equation*}\)

where \((\textbf{W}^{-1})^{(t)}=\textrm{diag}(w_{1}^{(t)},\ldots,w_{n}^{(t)})\) such that

\( w_{i}^{(t)}=\begin{cases}\dfrac{\psi((y_{i}-\textbf{x}_{i}^{\textrm{t}}\beta^{(t)})/\hat{\tau}^{(t)})}{(y_{i}\textbf{x}_{i}^{\textrm{t}}\beta^{(t)})/\hat{\tau}^{(t)}}, & \hbox{if \(y_{i}\neq\textbf{x}_{i}^{\textrm{T}}\beta^{(t)}\);} \\ 1, & \hbox{if \(y_{i}=\textbf{x}_{i}^{\textrm{T}}\beta^{(t)}\).} \end{cases} \)

Three common functions chosen in M-estimation are given below:

Andrews' Sine

\(\begin{align*}\rho(z)&=\begin{cases}\ c[1-\cos(z/c)], & \hbox{if \(|z|<\pi c\);}\\ 2c, & \hbox{if \(|z|\geq\pi c\)} \end{cases} \\ \psi(z)&=\begin{cases} \sin(z/c), & \hbox{if \(|z|<\pi c\);} \\ 0, & \hbox{if \(|z|\geq\pi c\)} \end{cases} \\ w(z)&=\begin{cases} \frac{\sin(z/c)}{z/c}, & \hbox{if \(|z|<\pi c\);} \\ 0, & \hbox{if \(|z|\geq\pi c\),} \end{cases} \end{align*}\) where \(c\approx1.339\).

Huber's Method

\(\begin{align*} \rho(z)&=\begin{cases} z^{2}, & \hbox{if \(|z|<c\);} \\ |2z|c-c^{2}, & \hbox{if \(|z|\geq c\)} \end{cases} \\ \psi(z)&=\begin{cases} z, & \hbox{if \(|z|<c\);} \\ c[\textrm{sgn}(z)], & \hbox{if \(|z|\geq c\)} \end{cases} \\ w(z)&=\begin{cases} 1, & \hbox{if \(|z|< c\);} \\ c/|z|. & \hbox{if \(|z|\geq c\),} \end{cases} \end{align*}\) where \(c\approx 1.345\).

Tukey's Biweight

\(\begin{align*} \rho(z)&= \begin{cases} \frac{c^{2}}{3}\biggl\{1-(1-(\frac{z}{c})^{2})^{3}\biggr\}, & \hbox{if \(|z|<c\);} \\ 2c, & \hbox{if \(|z|\geq c\)} \end{cases} \\ \psi(z)&= \begin{cases} z[1-(\frac{z}{c})^{2}]^{2}, & \hbox{if \(|z|<c\);} \\ 0, & \hbox{if \(|z|\geq c\)} \end{cases} \\ w(z)&= \begin{cases} [1-(\frac{z}{c})^{2}]^{2}, & \hbox{if \(|z|<c\);} \\ 0, & \hbox{if \(|z|\geq c\),} \end{cases} \end{align*}\) where \(c\approx 4.685\).