Lesson 13: Weighted Least Squares & Robust Regression

Lesson 13: Weighted Least Squares & Robust Regression

Overview

So far we have utilized ordinary least squares for estimating the regression line. However, aspects of the data (such as nonconstant variance or outliers) may require a different method for estimating the regression line. This lesson provides an introduction to some of the other available methods for estimating regression lines. To help with the discussions in this lesson, recall that the ordinary least squares estimate is

\(\begin{align*} \hat{\beta}_{\textrm{OLS}}&=\arg\min_{\beta}\sum_{i=1}^{n}\epsilon_{i}^{2} \\ &=(\textbf{X}^{\textrm{T}}\textbf{X})^{-1}\textbf{X}^{\textrm{T}}\textbf{Y} \end{align*}\)

Because of the alternative estimates to be introduced, the ordinary least squares estimate is written here as \(\hat{\beta}_{\textrm{OLS}}\) instead of b.

Objectives

Upon completion of this lesson, you should be able to:

  • Explain the idea behind weighted least squares.
  • Apply weighted least squares to regression examples with nonconstant variance.
  • Understand the purpose behind robust regression methods.

 Lesson 13 Code Files

Below is a zip file that contains all the data sets used in this lesson:

STAT501_Lesson13.zip

  • ca_learning.txt
  • home_price.txt
  • market_share.txt
  • quality_measure.txt

13.1 - Weighted Least Squares

13.1 - Weighted Least Squares

The method of ordinary least squares assumes that there is constant variance in the errors (which is called homoscedasticity). The method of weighted least squares can be used when the ordinary least squares assumption of constant variance in the errors is violated (which is called heteroscedasticity). The model under consideration is

\(\begin{equation*} \textbf{Y}=\textbf{X}\beta+\epsilon^{*}, \end{equation*}\)

where \(\epsilon^{*}\) is assumed to be (multivariate) normally distributed with mean vector 0 and nonconstant variance-covariance matrix

\(\begin{equation*} \left(\begin{array}{cccc} \sigma^{2}_{1} & 0 & \ldots & 0 \\ 0 & \sigma^{2}_{2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \sigma^{2}_{n} \\ \end{array} \right) \end{equation*}\)

If we define the reciprocal of each variance, \(\sigma^{2}_{i}\), as the weight, \(w_i = 1/\sigma^{2}_{i}\), then let matrix W be a diagonal matrix containing these weights:

\(\begin{equation*}\textbf{W}=\left( \begin{array}{cccc} w_{1} & 0 & \ldots & 0 \\ 0& w_{2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0& 0 & \ldots & w_{n} \\ \end{array} \right) \end{equation*}\)

The weighted least squares estimate is then

\(\begin{align*} \hat{\beta}_{WLS}&=\arg\min_{\beta}\sum_{i=1}^{n}\epsilon_{i}^{*2}\\ &=(\textbf{X}^{T}\textbf{W}\textbf{X})^{-1}\textbf{X}^{T}\textbf{W}\textbf{Y} \end{align*}\)

With this setting, we can make a few observations:

  • Since each weight is inversely proportional to the error variance, it reflects the information in that observation. So, an observation with small error variance has a large weight since it contains relatively more information than an observation with large error variance (small weight).
  • The weights have to be known (or more usually estimated) up to a proportionality constant.

To illustrate, consider the famous 1877 Galton data set, consisting of 7 measurements each of X = Parent (pea diameter in inches of parent plant) and Y = Progeny (average pea diameter in inches of up to 10 plants grown from seeds of the parent plant). Also included in the dataset are standard deviations, SD, of the offspring peas grown from each parent. These standard deviations reflect the information in the response Y values (remember these are averages) and so in estimating a regression model we should downweight the obervations with a large standard deviation and upweight the observations with a small standard deviation. In other words we should use weighted least squares with weights equal to \(1/SD^{2}\). The resulting fitted equation from Minitab for this model is:

Progeny = 0.12796 + 0.2048 Parent

Compare this with the fitted equation for the ordinary least squares model:

Progeny = 0.12703 + 0.2100 Parent

The equations aren't very different but we can gain some intuition into the effects of using weighted least squares by looking at a scatterplot of the data with the two regression lines superimposed:

Scatterplot of Progeny vs Parent

The black line represents the OLS fit, while the red line represents the WLS fit. The standard deviations tend to increase as the value of Parent increases, so the weights tend to decrease as the value of Parent increases. Thus, on the left of the graph where the observations are upweighted the red fitted line is pulled slightly closer to the data points, whereas on the right of the graph where the observations are downweighted the red fitted line is slightly further from the data points.

For this example the weights were known. There are other circumstances where the weights are known:

  • If the i-th response is an average of \(n_i\) equally variable observations, then \(Var\left(y_i \right)\) = \(\sigma^2/n_i\) and \(w_i\) = \(n_i\).
  • If the i-th response is a total of \(n_i\) observations, then \(Var\left(y_i \right)\) = \(n_i\sigma^2\) and \(w_i\) =1/ \(n_i\).
  • If variance is proportional to some predictor \(x_i\), then \(Var\left(y_i \right)\) = \(x_i\sigma^2\) and \(w_i\) =1/ \(x_i\).

In practice, for other types of dataset, the structure of W is usually unknown, so we have to perform an ordinary least squares (OLS) regression first. Provided the regression function is appropriate, the i-th squared residual from the OLS fit is an estimate of \(\sigma_i^2\) and the i-th absolute residual is an estimate of \(\sigma_i\) (which tends to be a more useful estimator in the presence of outliers). The residuals are much too variable to be used directly in estimating the weights, \(w_i,\) so instead we use either the squared residuals to estimate a variance function or the absolute residuals to estimate a standard deviation function. We then use this variance or standard deviation function to estimate the weights.


Some possible variance and standard deviation function estimates include:
  • If a residual plot against a predictor exhibits a megaphone shape, then regress the absolute values of the residuals against that predictor. The resulting fitted values of this regression are estimates of \(\sigma_{i}\). (And remember \(w_i = 1/\sigma^{2}_{i}\)).
  • If a residual plot against the fitted values exhibits a megaphone shape, then regress the absolute values of the residuals against the fitted values. The resulting fitted values of this regression are estimates of \(\sigma_{i}\).
  • If a residual plot of the squared residuals against a predictor exhibits an upward trend, then regress the squared residuals against that predictor. The resulting fitted values of this regression are estimates of \(\sigma_{i}^2\).
  • If a residual plot of the squared residuals against the fitted values exhibits an upward trend, then regress the squared residuals against the fitted values. The resulting fitted values of this regression are estimates of \(\sigma_{i}^2\).

After using one of these methods to estimate the weights, \(w_i\), we then use these weights in estimating a weighted least squares regression model. We consider some examples of this approach in the next section.


Some key points regarding weighted least squares are:
  1. The difficulty, in practice, is determining estimates of the error variances (or standard deviations).
  2. Weighted least squares estimates of the coefficients will usually be nearly the same as the "ordinary" unweighted estimates. In cases where they differ substantially, the procedure can be iterated until estimated coefficients stabilize (often in no more than one or two iterations); this is called iteratively reweighted least squares.
  3. In some cases, the values of the weights may be based on theory or prior research.
  4. In designed experiments with large numbers of replicates, weights can be estimated directly from sample variances of the response variable at each combination of predictor variables.
  5. Use of weights will (legitimately) impact the widths of statistical intervals.

13.2 - Weighted Least Squares Examples

13.2 - Weighted Least Squares Examples

Example 13-1: Computer-Assisted Learning Dataset

Student learning on the computer

The Computer Assisted Learning New data was collected from a study of computer-assisted learning by n = 12 students.

i Responses Cost
1 16 77
2 14 70
3 22 85
4 10 50
5 14 62
6 17 70
7 10 55
8 13 63
9 19 88
10 12 57
11 18 81
12 11 51

The response is the cost of the computer time (Y) and the predictor is the total number of responses in completing a lesson (X). A scatterplot of the data is given below.

scatterplot of cost vs num

From this scatterplot, a simple linear regression seems appropriate for explaining this relationship.

First an ordinary least squares line is fit to this data.  Below is the summary of the simple linear regression fit for this data

Model Summary
S R-sq R-sq(adj) R-sq(pred)
4.59830 88.91% 87.80% 81.27%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 19.47 5.52 3.53 0.005  
num.responses 3.269 0.365 8.95 0.000 1.00
Regression Equation

cost = 19.47 + 3.269 num.responses

A plot of the residuals versus the predictor values indicates possible nonconstant variance since there is a very slight "megaphone" pattern:

residual plot

We will turn to weighted least squares to address this possiblity. The weights we will use will be based on regressing the absolute residuals versus the predictor. In Minitab we can use the Storage button in the Regression Dialog to store the residuals. Then we can use Calc > Calculator to calculate the absolute residuals. A plot of the absolute residuals versus the predictor values is as follows:

scatterplot of absres vs num

The weights we will use will be based on regressing the absolute residuals versus the predictor. Specifically, we will fit this model, use the Storage button to store the fitted values and then use Calc > Calculator to define the weights as 1 over the squared fitted values. Then we fit a weighted least squares regression model by fitting a linear regression model in the usual way but clicking "Options" in the Regression Dialog and selecting the just-created weights as "Weights."

The summary of this weighted least squares fit is as follows:

Model Summary
S R-sq R-sq(adj) R-sq(pred)
1.15935 89.51% 88.46% 83.87%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 17.30 4.83 3.58 0.005  
num.responses 3.421 0.370 9.24 0.000 1.00

Notice that the regression estimates have not changed much from the ordinary least squares method. The following plot shows both the OLS fitted line (black) and WLS fitted line (red) overlaid on the same scatterplot.

scatterplot of cost vs num

A plot of the studentized residuals (remember Minitab calls these "standardized" residuals) versus the predictor values when using the weighted least squares method shows how we have corrected for the megaphone shape since the studentized residuals appear to be more randomly scattered about 0:

residual plot

With weighted least squares, it is crucial that we use studentized residuals to evaluate the aptness of the model, since these take into account the weights that are used to model the changing variance. The usual residuals don't do this and will maintain the same non-constant variance pattern no matter what weights have been used in the analysis.

Example 13-2: Market Share Data

Here we have market share data for n = 36 consecutive months (Market Share data). Let Y = market share of the product; \(X_1\) = price; \(X_2\) = 1 if discount promotion in effect and 0 otherwise; \(X_2\)\(X_3\) = 1 if both discount and package promotions in effect and 0 otherwise. The regression results below are for a useful model in this situation:

Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 3.196 0.356 8.97 0.000  
Price -0.334 0.152 -2.19 0.036 1.01
Discount 0.3081 0.0641 4.80 0.000 1.68
DiscountPromotion 0.1762 0.0660 2.67 0.012 1.69

This model represents three different scenarios:  

  1. Months in which there was no discount (and either a package promotion or not): X2 = 0 (and X3 = 0 or 1);
  2. Months in which there was a discount but no package promotion: X2 = 1 and X3 = 0;
  3. Months in which there was both a discount and a package promotion: X2 = 1 and X3 = 1.

So, it is fine for this model to break hierarchy if there is no significant difference between the months in which there was no discount and no package promotion and months in which there was no discount but there was a package promotion.

A residual plot suggests nonconstant variance related to the value of \(X_2\):

Scatterplot of RESI1 vs FITS1

From this plot, it is apparent that the values coded as 0 have a smaller variance than the values coded as 1. The residual variances for the two separate groups defined by the discount pricing variable are:

Discount N StDev Variance 95% CI for StDevs
0 15 0.103 0.011 (0.077, 0.158)
1 21 0.164 0.027 (0.136, 0.217)

Because of this nonconstant variance, we will perform a weighted least squares analysis. For the weights, we use \(w_i=1 / \hat{\sigma}_i^2\) for i = 1, 2 (in Minitab use Calc > Calculator and define "weight" as ‘Discount'/0.027 + (1-‘Discount')/0.011 . The weighted least squares analysis (set the just-defined "weight" variable as "weights" under Options in the Regression dialog) are as follows:

Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 3 96.109 32.036 30.84 0.000
Price 1 4.688 4.688 4.51 0.041
Discount 1 23.039 23.039 22.18 0.000
DiscountPromotion 1 5.634 5.634 5.42 0.026
Error 32 33.246 1.039    
Total 35 129.354      
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 3.175 0.357 8.90 0.000  
Price -0.325 0.153 -2.12 0.041 1.01
Discount 0.3083 0.0655 4.71 0.000 2.04
DiscountPromotion 0.1759 0.0755 2.33 0.026 2.05
 

An important note is that Minitab’s ANOVA will be in terms of the weighted SS. When doing a weighted least squares analysis, you should note how different the SS values of the weighted case are from the SS values for the unweighted case.

Also, note how the regression coefficients of the weighted case are not much different from those in the unweighted case. Thus, there may not be much of an obvious benefit to using the weighted analysis (although intervals are going to be more reflective of the data).

If you proceed with a weighted least squares analysis, you should check a plot of the residuals again. Remember to use the studentized residuals when doing so! For this example, the plot of studentized residuals after doing a weighted least squares analysis is given below and the residuals look okay (remember Minitab calls these standardized residuals).

Residuals Versus the Fitted Values plot

Example 13-3: Home Price Dataset

For sale sign

The Home Price data set has the following variables:

Y = sale price of a home
\(X_1\) = square footage of the home
\(X_2\) = square footage of the lot

Since all the variables are highly skewed we first transform each variable to its natural logarithm. Then when we perform a regression analysis and look at a plot of the residuals versus the fitted values (see below), we note a slight “megaphone” or “conic” shape of the residuals.

Term Coef SE Coeff T-Value P-Value VIF
Constant 1.964 0.313 6.28 0.000  
logX1 1.2198 0.0340 35.87 0.000 1.05
logX2 0.1103 0.0241 4.57 0.000 1.05
residuals vs fitted values plot

We interpret this plot as having a mild pattern of nonconstant variance in which the amount of variation is related to the size of the mean (which are the fits).

So, we use the following procedure to determine appropriate weights:

  • Store the residuals and the fitted values from the ordinary least squares (OLS) regression.
  • Calculate the absolute values of the OLS residuals.
  • Regress the absolute values of the OLS residuals versus the OLS fitted values and store the fitted values from this regression. These fitted values are estimates of the error standard deviations.
  • Calculate weights equal to \(1/fits^{2}\), where "fits" are the fitted values from the regression in the last step.

We then refit the original regression model but using these weights this time in a weighted least squares (WLS) regression.

Results and a residual plot for this WLS model:

Term Coef SE Coeff T-Value P-Value VIF
Constant 2.377 0.284 8.38 0.000  
logX1 1.2014 0.0336 35.72 0.000 1.08
logX2 0.0831 0.0217 3.83 0.000 1.08
residuals vs fitted values plot

13.3 - Robust Regression Methods

13.3 - Robust Regression Methods
Note! that the material in Sections 13.3-13.6 is considerably more technical than preceding Lessons. It is offered as an introduction to this advanced topic and, given the technical nature of the material, it could be considered optional in the context of this course

The ordinary least squares estimates for linear regression are optimal when all of the regression assumptions are valid. When some of these assumptions are invalid, least squares regression can perform poorly. Residual diagnostics can help guide you to where the breakdown in assumptions occur, but can be time consuming and sometimes difficult to the untrained eye. Robust regression methods provide an alternative to least squares regression by requiring less restrictive assumptions. These methods attempt to dampen the influence of outlying cases in order to provide a better fit to the majority of the data.

Outliers have a tendency to pull the least squares fit too far in their direction by receiving much more "weight" than they deserve. Typically, you would expect that the weight attached to each observation would be on average 1/n in a data set with n observations. However, outliers may receive considerably more weight, leading to distorted estimates of the regression coefficients. This distortion results in outliers which are difficult to identify since their residuals are much smaller than they would otherwise be (if the distortion wasn't present). As we have seen, scatterplots may be used to assess outliers when a small number of predictors are present. However, the complexity added by additional predictor variables can hide the outliers from view in these scatterplots. Robust regression down-weights the influence of outliers, which makes their residuals larger and easier to identify.

For our first robust regression method, suppose we have a data set of size n such that

\(\begin{align*} y_{i}&=\textbf{x}_{i}^{\textrm{T}}\beta+\epsilon_{i} \\ \Rightarrow\epsilon_{i}(\beta)&=y_{i}-\textbf{x}_{i}^{\textrm{T}}\beta, \end{align*}\)

where \(i=1,\ldots,n\). Here we have rewritten the error term as \(\epsilon_{i}(\beta)\) to reflect the error term's dependency on the regression coefficients. Ordinary least squares is sometimes known as \(L_{2}\)-norm regression since it is minimizing the \(L_{2}\)-norm of the residuals (i.e., the squares of the residuals). Thus, observations with high residuals (and high squared residuals) will pull the least squares fit more in that direction. An alternative is to use what is sometimes known as least absolute deviation (or \(L_{1}\)-norm regression), which minimizes the \(L_{1}\)-norm of the residuals (i.e., the absolute value of the residuals). Formally defined, the least absolute deviation estimator is

\(\begin{equation*} \hat{\beta}_{\textrm{LAD}}=\arg\min_{\beta}\sum_{i=1}^{n}|\epsilon_{i}(\beta)|, \end{equation*}\)

which in turn minimizes the absolute value of the residuals (i.e., \(|r_{i}|\)).

Another quite common robust regression method falls into a class of estimators called M-estimators (and there are also other related classes such as R-estimators and S-estimators, whose properties we will not explore). M-estimators attempt to minimize the sum of a chosen function \(\rho(\cdot)\) which is acting on the residuals. Formally defined, M-estimators are given by

\(\begin{equation*} \hat{\beta}_{\textrm{M}}=\arg\min _{\beta}\sum_{i=1}^{n}\rho(\epsilon_{i}(\beta)). \end{equation*}\)

The M stands for "maximum likelihood" since \(\rho(\cdot)\) is related to the likelihood function for a suitable assumed residual distribution. Notice that, if assuming normality, then \(\rho(z)=\frac{1}{2}z^{2}\) results in the ordinary least squares estimate.

Some M-estimators are influenced by the scale of the residuals, so a scale-invariant version of the M-estimator is used:

\(\begin{equation*} \hat{\beta}_{\textrm{M}}=\arg\min_{\beta}\sum_{i=1}^{n}\rho\biggl(\frac{\epsilon_{i}(\beta)}{\tau}\biggr), \end{equation*}\)

where \(\tau\) is a measure of the scale. An estimate of \(\tau\) is given by

\(\begin{equation*} \hat{\tau}=\frac{\textrm{med}_{i}|r_{i}-\tilde{r}|}{0.6745}, \end{equation*}\)

where \(\tilde{r}\) is the median of the residuals. Minimization of the above is accomplished primarily in two steps:

  1.  Set \(\frac{\partial\rho}{\partial\beta_{j}}=0\) for each \(j=0,1,\ldots,p-1\), resulting in a set of p nonlinear equations\(\begin{equation*} \sum_{i=1}^{n}x_{i,j}\psi\biggl(\dfrac{\epsilon_{i}}{\tau}\biggr)=0, \end{equation*}\) where \(\psi(\cdot)=\rho'(\cdot)\). \(\psi(\cdot)\) is called the influence function.
  2. A numerical method called iteratively reweighted least squares (IRLS) (mentioned in Section 13.1) is used to iteratively estimate the weighted least squares estimate until a stopping criterion is met. Specifically, for iterations \(t=0,1,\ldots\)

    \(\begin{equation*} \hat{\beta}^{(t+1)}=(\textbf{X}^{\textrm{T}}(\textbf{W}^{-1})^{(t)}\textbf{X})^{-1}\textbf{X}^{\textrm{T}}(\textbf{W}^{-1})^{(t)}\textbf{y}, \end{equation*}\)

    where \((\textbf{W}^{-1})^{(t)}=\textrm{diag}(w_{1}^{(t)},\ldots,w_{n}^{(t)})\) such that

    \( w_{i}^{(t)}=\begin{cases}\dfrac{\psi((y_{i}-\textbf{x}_{i}^{\textrm{t}}\beta^{(t)})/\hat{\tau}^{(t)})}{(y_{i}\textbf{x}_{i}^{\textrm{t}}\beta^{(t)})/\hat{\tau}^{(t)}}, & \hbox{if \(y_{i}\neq\textbf{x}_{i}^{\textrm{T}}\beta^{(t)}\);} \\ 1, & \hbox{if \(y_{i}=\textbf{x}_{i}^{\textrm{T}}\beta^{(t)}\).} \end{cases} \)

Three common functions chosen in M-estimation are given below:

Andrews' Sine

\(\begin{align*}\rho(z)&=\begin{cases}\ c[1-\cos(z/c)], & \hbox{if \(|z|<\pi c\);}\\ 2c, & \hbox{if \(|z|\geq\pi c\)} \end{cases}  \\ \psi(z)&=\begin{cases} \sin(z/c), & \hbox{if \(|z|<\pi c\);} \\  0, & \hbox{if \(|z|\geq\pi c\)}  \end{cases} \\ w(z)&=\begin{cases} \frac{\sin(z/c)}{z/c}, & \hbox{if \(|z|<\pi c\);} \\ 0, & \hbox{if \(|z|\geq\pi c\),} \end{cases}  \end{align*}\) where \(c\approx1.339\).

Huber's Method

\(\begin{align*} \rho(z)&=\begin{cases} z^{2}, & \hbox{if \(|z|<c\);} \\   |2z|c-c^{2}, & \hbox{if \(|z|\geq c\)} \end{cases}   \\ \psi(z)&=\begin{cases} z, & \hbox{if \(|z|<c\);} \\   c[\textrm{sgn}(z)], & \hbox{if \(|z|\geq c\)} \end{cases}   \\ w(z)&=\begin{cases} 1, & \hbox{if \(|z|< c\);} \\  c/|z|. & \hbox{if \(|z|\geq c\),} \end{cases}  \end{align*}\) where \(c\approx 1.345\).

Tukey's Biweight

\(\begin{align*} \rho(z)&= \begin{cases} \frac{c^{2}}{3}\biggl\{1-(1-(\frac{z}{c})^{2})^{3}\biggr\}, & \hbox{if \(|z|<c\);} \\ 2c, & \hbox{if \(|z|\geq c\)} \end{cases}  \\ \psi(z)&= \begin{cases} z[1-(\frac{z}{c})^{2}]^{2}, & \hbox{if \(|z|<c\);} \\   0, & \hbox{if \(|z|\geq c\)} \end{cases}   \\ w(z)&= \begin{cases} [1-(\frac{z}{c})^{2}]^{2}, & \hbox{if \(|z|<c\);} \\   0, & \hbox{if \(|z|\geq c\),} \end{cases} \end{align*}\) where \(c\approx 4.685\).


13.4 - Resistant Regression Methods

13.4 - Resistant Regression Methods

The next method we discuss is often used interchangeably with robust regression methods. However, there is a subtle difference between the two methods that is not usually outlined in the literature. Whereas robust regression methods attempt to only dampen the influence of outlying cases, resistant regression methods use estimates that are not influenced by any outliers (this comes from the definition of resistant statistics, which are measures of the data that are not influenced by outliers, such as the median). This is best accomplished by trimming the data, which "trims" extreme values from either end (or both ends) of the range of data values.

There is also one other relevant term when discussing resistant regression methods. Suppose we have a data set \(x_{1},x_{2},\ldots,x_{n}\). The order statistics are simply defined to be the data values arranged in increasing order and are written as \(x_{(1)},x_{(2)},\ldots,x_{(n)}\). Therefore, the minimum and maximum of this data set are \(x_{(1)}\) and \(x_{(n)}\), respectively. As we will see, the resistant regression estimators provided here are all based on the ordered residuals.

We present three commonly used resistant regression methods:

Least Quantile of Squares

The least quantile of squares method minimizes the squared order residual (presumably selected as it is most representative of where the data is expected to lie) and is formally defined by \(\begin{equation*} \hat{\beta}_{\textrm{LQS}}=\arg\min_{\beta}\epsilon_{(\nu)}^{2}(\beta), \end{equation*}\) where \(\nu=P*n\) is the \(P^{\textrm{th}}\) percentile (i.e., \(0<P\leq 1\)) of the empirical data (if \(\nu\) is not an integer, then specify \(\nu\) to be either the next greatest or lowest integer value). A specific case of the least quantile of squares method where p = 0.5 (i.e., the median) and is called the least median of squares method (and the estimate is often written as \(\hat{\beta}_{\textrm{LMS}}\)).

Least Trimmed Sum of Squares

The least trimmed sum of squares method minimizes the sum of the \(h\) smallest squared residuals and is formally defined by \(\begin{equation*} \hat{\beta}_{\textrm{LTS}}=\arg\min_{\beta}\sum_{i=1}^{h}\epsilon_{(i)}^{2}(\beta), \end{equation*}\) where \(h\leq n\). If h = n, then you just obtain \(\hat{\beta}_{\textrm{OLS}}\).

Least Trimmed Sum of Absolute Deviations

The least trimmed sum of absolute deviations method minimizes the sum of the h smallest absolute residuals and is formally defined by \(\begin{equation*} \hat{\beta}_{\textrm{LTA}}=\arg\min_{\beta}\sum_{i=1}^{h}|\epsilon(\beta)|_{(i)}, \end{equation*}\) where again \(h\leq n\). If h = n, then you just obtain \(\hat{\beta}_{\textrm{LAD}}\).

So, which method from robust or resistant regressions do we use? In order to guide you in the decision-making process, you will want to consider both the theoretical benefits of a certain method as well as the type of data you have. The theoretical aspects of these methods that are often cited include their breakdown values and overall efficiency. Breakdown values are a measure of the proportion of contamination (due to outlying observations) that an estimation method can withstand and still maintain being robust against the outliers. Efficiency is a measure of an estimator's variance relative to another estimator (when it is the smallest it can possibly be, then the estimator is said to be "best"). For example, the least quantile of squares method and least trimmed sum of squares method both have the same maximal breakdown value for certain P, the least median of squares method is of low efficiency, and the least trimmed sum of squares method has the same efficiency (asymptotically) as certain M-estimators. As for your data, if there appear to be many outliers, then a method with a high breakdown value should be used. A preferred solution is to calculate many of these estimates for your data and compare their overall fits, but this will likely be computationally expensive.


13.5 - Regression Depth

13.5 - Regression Depth

We have discussed the notion of ordering data (e.g., ordering the residuals). The applications we have presented with ordered data have all concerned univariate data sets. However, there are also techniques for ordering multivariate data sets. Statistical depth functions provide a center-outward ordering of multivariate observations, which allows one to define reasonable analogues of univariate order statistics. There are numerous depth functions, which we do not discuss here. However, the notion of statistical depth is also used in the regression setting. Specifically, there is the notion of regression depth, which is a quality measure for robust linear regression.

Statistically speaking, the regression depth of a hyperplane \(\mathcal{H}\) is the smallest number of residuals that need to change sign to make \(\mathcal{H}\) a nonfit. This definition also has convenient statistical properties, such as invariance under affine transformations, which we do not discuss in greater detail. A regression hyperplane is called a nonfit if it can be rotated to horizontal (i.e., parallel to the axis of any of the predictor variables) without passing through any data points. (We count the points exactly on the hyperplane as "passed through".) A nonfit is a very poor regression hyperplane, because it is combinatorially equivalent to a horizontal hyperplane, which posits no relationship between predictor and response variables. The regression depth of a hyperplane (say, \(\mathcal{L}\)) is the minimum number of points whose removal makes \(\mathcal{H}\) into a nonfit. For example, consider the data in the figure below.

xY246810468101214

Removing the red circles and rotating the regression line until horizontal (i.e., the dashed blue line) demonstrates that the black line has regression depth 3. Hyperplanes with high regression depth behave well in general error models, including skewed or distributions with heteroscedastic errors.

The regression depth of n points in p dimensions is upper bounded by \(\lceil n/(p+1)\rceil\), where p is the number of variables (i.e., the number of responses plus the number of predictors). In other words, there exist point sets for which no hyperplane has regression depth larger than this bound. For the simple linear regression example in the plot above, this means there is always a line with regression depth of at least \(\lceil n/3\rceil\).

When confronted with outliers, then you may be confronted with the choice of other regression lines or hyperplanes to consider for your data. Some of these regressions may be biased or altered from the traditional ordinary least squares line. In such cases, regression depth can help provide a measure of a fitted line that best captures the effects due to outliers.


13.6 - Robust Regression Examples

13.6 - Robust Regression Examples

Quality Measurements Dataset

Let us look at the three robust procedures discussed earlier for the Quality Measure data set. These estimates are provided in the table below for comparison with the ordinary least squares estimate.

A comparison of M-estimators with the ordinary least squares estimator for the quality measurements data set (analysis done in R since Minitab does not include these procedures):

Method \(\boldsymbol{\beta_{0}}\) p-value \(\boldsymbol{\beta_{1}}\) p-value
OLS 29.2342 0.0276 0.5235 0.0045
Andrew's Sine 29.0080 0.0177 0.5269 0.0024
Huber's Method 29.5428 0.0204 0.5183 0.0036
Tukey's Biweight 29.1399 0.0203 0.5248 0.0030

While there is not much of a difference here, it appears that Andrew's Sine method is producing the most significant values for the regression estimates. One may wish to then proceed with residual diagnostics and weigh the pros and cons of using this method over ordinary least squares (e.g., interpretability, assumptions, etc.).


Software Help 13

Software Help 13

  

The next two pages cover the Minitab and R commands for the procedures in this lesson.

Below is a zip file that contains all the data sets used in this lesson:

STAT501_Lesson13.zip

  • ca_learning.txt
  • home_price.txt
  • market_share.txt
  • quality_measure.txt

Minitab Help 13: Weighted Least Squares

Minitab Help 13: Weighted Least Squares

Minitab®

Galton peas (nonconstant variance and weighted least squares)

  • Perform a linear regression analysis to fit an ordinary least squares (OLS) simple linear regression model of Progeny vs Parent (click "Storage" in the regression dialog to store fitted values).
  • Select Calc > Calculator to calculate the weights variable = \(1/SD^{2}\) and Perform a linear regression analysis to fit a weighted least squares (WLS) model (click "Options" in the regression dialog to set the weights variable and click "Storage" to store fitted values).
  • Create a basic scatterplot< of the data and click Editor > Add > Calculated Line to add a regression line for each model using the stored fitted values.

Computer-assisted learning (nonconstant variance and weighted least squares)

Market share (nonconstant variance and weighted least squares)

  • Perform a linear regression analysis to fit an OLS model (click "Storage" to store the residuals and fitted values).
  • Create a basic scatterplot of the OLS residuals vs fitted values but select "With Groups" to mark the points by Discount.
  • Select Stat > Basic Statistics > Display Descriptive Statistics to calculate the residual variance for Discount=0 and Discount=1.
  • Select Calc > Calculator to calculate the weights variable = 1/variance for Discount=0 and Discount=1, Perform a linear regression analysis to fit a WLS model (click "Options" to set the weights variable and click "Storage" to store standardized residuals and fitted values).
  • Create a basic scatterplot of the WLS standardized residuals vs fitted values.

Home price (nonconstant variance and weighted least squares)


R Help 13: Weighted Least Squares

R Help 13: Weighted Least Squares

R

Galton peas (nonconstant variance and weighted least squares)

  • Load the galton data.
  • Fit an ordinary least squares (OLS) simple linear regression model of Progeny vs Parent.
  • Fit a weighted least squares (WLS) model using weights = \(1/{SD^2}\).
  • Create a scatterplot of the data with a regression line for each model.
galton <- read.table("~/path-to-data/galton.txt", header=T)
attach(galton)

model.1 <- lm(Progeny ~ Parent)
summary(model.1)
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 0.127029   0.006993  18.164 9.29e-06 ***
# Parent      0.210000   0.038614   5.438  0.00285 ** 

model.2 <- lm(Progeny ~ Parent, weights=1/SD^2)
summary(model.2)
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 0.127964   0.006811  18.787 7.87e-06 ***
# Parent      0.204801   0.038155   5.368  0.00302 ** 

plot(x=Parent, y=Progeny, ylim=c(0.158,0.174),
     panel.last = c(lines(sort(Parent), fitted(model.1)[order(Parent)], col="blue"),
                    lines(sort(Parent), fitted(model.2)[order(Parent)], col="red")))
legend("topleft", col=c("blue","red"), lty=1,
       inset=0.02, legend=c("OLS", "WLS"))

detach(galton)

Computer-assisted learning (nonconstant variance and weighted least squares)

  • Load the ca_learning data.
  • Create a scatterplot of the data.
  • Fit an OLS model.
  • Plot the OLS residuals vs num.responses.
  • Plot the absolute OLS residuals vs num.responses.
  • Calculate fitted values from a regression of absolute residuals vs num.responses.
  • Fit a WLS model using weights = \(1/{(\text{fitted values})^2}\).
  • Create a scatterplot of the data with a regression line for each model.
  • Plot the WLS standardized residuals vs num.responses.
ca_learning <- read.table("~/path-to-data/ca_learning_new.txt", header=T)
attach(ca_learning)

plot(x=num.responses, y=cost)

model.1 <- lm(cost ~ num.responses)
summary(model.1)
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    19.4727     5.5162   3.530  0.00545 ** 
# num.responses   3.2689     0.3651   8.955 4.33e-06 ***
# ---
# Residual standard error: 4.598 on 10 degrees of freedom
# Multiple R-squared:  0.8891,  Adjusted R-squared:  0.878 
# F-statistic: 80.19 on 1 and 10 DF,  p-value: 4.33e-06

plot(num.responses, residuals(model.1))
plot(num.responses, abs(residuals(model.1)))

wts <- 1/fitted(lm(abs(residuals(model.1)) ~ num.responses))^2

model.2 <- lm(cost ~ num.responses, weights=wts)
summary(model.2)
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    17.3006     4.8277   3.584  0.00498 ** 
# num.responses   3.4211     0.3703   9.238 3.27e-06 ***
# ---
# Residual standard error: 1.159 on 10 degrees of freedom
# Multiple R-squared:  0.8951,  Adjusted R-squared:  0.8846 
# F-statistic: 85.35 on 1 and 10 DF,  p-value: 3.269e-06

plot(x=num.responses, y=cost, ylim=c(50,95),
     panel.last = c(lines(sort(num.responses), fitted(model.1)[order(num.responses)], col="blue"),
                    lines(sort(num.responses), fitted(model.2)[order(num.responses)], col="red")))
legend("topleft", col=c("blue","red"), lty=1,
       inset=0.02, legend=c("OLS", "WLS"))

plot(num.responses, rstandard(model.2))

detach(ca_learning)

Market share (nonconstant variance and weighted least squares)

  • Load the marketshare data.
  • Fit an OLS model.
  • Plot the OLS residuals vs fitted values with points marked by Discount.
  • Use the tapply function to calculate the residual variance for Discount=0 and Discount=1.
  • Fit a WLS model using weights = 1/variance for Discount=0 and Discount=1.
  • Plot the WLS standardized residuals vs fitted values.
marketshare <- read.table("~/path-to-data/marketshare.txt", header=T)
attach(marketshare)

model.1 <- lm(MarketShare ~ Price + P1 + P2)
summary(model.1)
  #             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  3.19592    0.35616   8.973 3.00e-10 ***
# Price       -0.33358    0.15226  -2.191   0.0359 *  
# P1           0.30808    0.06412   4.804 3.51e-05 ***
# P2           0.48431    0.05541   8.740 5.49e-10 ***

plot(fitted(model.1), residuals(model.1), col=Discount+1)
vars <- tapply(residuals(model.1), Discount, var)
#          0          1 
# 0.01052324 0.02680546 

wts <- Discount/vars[2] + (1-Discount)/vars[1]

model.2 <- lm(MarketShare ~ Price + P1 + P2, weights=wts)
summary(model.2)
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  3.17437    0.35671   8.899 3.63e-10 ***
# Price       -0.32432    0.15291  -2.121   0.0418 *  
# P1           0.30834    0.06575   4.689 4.89e-05 ***
# P2           0.48419    0.05422   8.930 3.35e-10 ***

plot(fitted(model.2), rstandard(model.2), col=Discount+1)

detach(marketshare)

Home price (nonconstant variance and weighted least squares)

  • Load the realestate data.
  • Calculate log transformations of the variables.
  • Fit an OLS model.
  • Plot the OLS residuals vs fitted values.
  • Calculate fitted values from a regression of absolute residuals vs fitted values.
  • Fit a WLS model using weights = \(1/{(\text{fitted values})^2}\).
  • Plot the WLS standardized residuals vs fitted values.
realestate <- read.table("~/path-to-data/realestate.txt", header=T)
attach(realestate)

logY <- log(SalePrice)
logX1 <- log(SqFeet)
logX2 <- log(Lot)

model.1 <- lm(logY ~ logX1 + logX2)
summary(model.1)
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  4.25485    0.07353  57.864  < 2e-16 ***
# logX1        1.22141    0.03371  36.234  < 2e-16 ***
# logX2        0.10595    0.02394   4.425 1.18e-05 ***

plot(fitted(model.1), residuals(model.1))

wts <- 1/fitted(lm(abs(residuals(model.1)) ~ fitted(model.1)))^2

model.2 <- lm(logY ~ logX1 + logX2, weights=wts)
summary(model.2)
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  4.35189    0.06330  68.755  < 2e-16 ***
# logX1        1.20150    0.03332  36.065  < 2e-16 ***
# logX2        0.07924    0.02152   3.682 0.000255 ***

plot(fitted(model.2), rstandard(model.2))

detach(realestate)

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility