12.3 - Addendum: Estimating Equations and the Sandwich

Quasi-likelihood

Suppose that we observe responses \(Y_1,Y_1 , \dots , Y_N\) which we want to relate to covariates. The ordinary linear regression model is

\(Y_i \sim N(x_i^T\beta,\sigma^2)\)

where \(x_i\) is a vector of covariates, \(\beta\) is a vector of coefficients to be estimated, and \(\sigma^2\) is the error variance. Now let's generalize this model in two ways:

Introduce a link function for the mean \(E ( Y_i ) = \mu_i\),

\(g(\mu_i)=x_i^T\beta\).

We could also write this as

\(E(Y_i)=\mu_i(\beta)\quad\) (2)

where the covariates and the link become part of the function \(\mu_i\). For example, in a loglinear model we would have \(\mu_i=\exp(x_i^T\beta)\).

Allow for heteroscedasticity, so that

\(\text{Var}(Y_i)=V_i\quad\) (3)

In many cases \(V_i\) will depend on the mean \(\mu_i\) and therefore on \(\beta\), and possibly on additional unknown parameters(e.g. a scale factor). For example, in a traditional overdispersed loglinear model, we would have \(V_i=\sigma^2 \mu_i=\sigma^2 \exp(x_i^T\beta)\).

A maximum-likelihood estimate for \(\beta\) under this model could be computed by a Fisher scoring procedure. ML estimates have two nice theoretical properties: they are approximately unbiased and highly efficient. Interestingly, the asymptotic theory underlying these properties does not really depend on the normality of \(Y_i\), but only on the first two moments. That is if the mean function (2) and the variance function (3) are correct, but the distribution of \(Y_i\) is not normal, the estimate of \(\beta\) obtained by maximizing the normal loglikelihood from

\(Y_i \sim N(\mu_i(\beta), V_i(\beta))\)

is still asymptotically unbiased and efficient.

Quasi-scoring

If we maximize the normality-based loglikelihood without assuming that the response is normally distributed, the resulting estimate of \(\beta\) is called a quasi-likelihood estimate. The iterative procedure for computing the quasi-likelihood estimate, called quasi-scoring, proceeds as follows.

First, let's collect the responses and their means into vectors of length \(N\),

\(Y=
\begin{bmatrix}
Y_1\\
Y_2\\
\vdots\\
Y_N
\end{bmatrix},
\mu=
\begin{bmatrix}
\mu_1\\
\mu_2\\
\vdots\\
\mu_N
\end{bmatrix} \)

Also, let \(V\) be the \(N \times N\) matrix with \(V_1 , \dots , V_N\) on the diagonal and zeros elsewhere. Finally, let

\(D=\dfrac{\partial \mu}{\partial \beta}\)

be the \(N \times p\) matrix containing the partial derivatives of \(\mu\) with respect to \(\beta\). That is, the \(\left(i, j\right)^{th}\) element of \(D\) is \(\dfrac{∂\mu_i }{∂\beta_i}\) . In a simple linear model with an identity link, \(\mu_i=x_i^T\beta\), \(D\) is simply the \(N \times p\) matrix of covariates

\(X=
\begin{bmatrix}
x_1^T\\
x_2^T\\
\vdots\\
x_N^T
\end{bmatrix}\)

The iterative quasi-scoring procedure is

\(\beta_{\mbox{new}}-\beta_{\mbox{old}}= \left( D^T V^{-1} D\right)^{-1} D^T V^{-1}(Y-\mu)\quad\) (4)

where on the right-hand side of (4) the quantities \(D\), \(\tilde{V}\), and \(\mu\) are evaluated at \(\beta = \beta_{\mbox{old}}\).

It is easy to see that in the case of simple linear regression with homoscedastic response (\(\mu_i=x_i^T\beta\) and \(V_i=\sigma^2\) ), the quasi-scoring procedure converges to the OLS estimate

\(\hat{\beta}=\left(X^T X\right)^{-1}X^T Y\)

in a single step regardless of \(\beta_{\mbox{old}}\). In other cases (e.g. loglinear regression where \(\mu_i=\exp(x_i^T\beta)\), \(V_i=\sigma^2 \mu_i\) ) it produces the same estimate for \(\beta\) that we obtained earlier by fitting the generalized linear model.

Quasi-likelihood estimation is really the same thing as generalized linear modeling, except that we no longer have a full parametric model for \(Y_i\).

If we let \(U\) be the \(p \times 1\) vector

\(U=D^TV^{-1}(Y-\mu)\)

the final estimate from the quasi-scoring procedure satisfies the condition \(U = 0\). \(U\) can be regarded as the first derivative of the quasi-loglikelihood function. The first derivative of a loglikelihood is called a score, so \(U\) is called a quasi-score. \(U = 0\) is often called the set of estimating equations, and the final estimate for \(\beta\) is the solution to the estimating equations.

Estimating equations with a working variance function. We'll suppose that the mean regression function \(\mu_i \left(\beta\right)\) has been correctly specified but the variance function has not. That is, the data analyst incorrectly supposes that the variance function for \(Y_i\) is \(\tilde{V}_i\) rather than \(V_i\) , where \(\tilde{V}_i\) is another function of \(\beta\). The analyst then estimates \(\beta\) by solving the quasi-score equations

\(U=D^T \tilde{V}^{-1}(Y-\mu)=0\)

where \(\tilde{V}_i\) is the diagonal matrix of \(\tilde{V}_i\)s. Obviously it would be better to perform the scoring procedure using the true variance function \(V = \text{Diag}\left(V_1 , \dots , V_N\right)\) rather than \(\tilde{V}\). What are the properties of this procedure where \(\tilde{V}\) has been used instead of \(V\)?

The estimate \(\hat{\beta}\) is a consistent and asymptotically unbiased estimate of \(\beta\), even if \(\tilde{V} \neq V\). It's also asymptotically normal.
If \(\tilde{V} \neq V\) then \(\hat{\beta}\) is not efficient; that is, the asymptotic variance of \(\hat{\beta}\) is lowest when \(\tilde{V} = V\).
If \(\tilde{V}= V\) then the final value of the matrix \(\left(D^T V^{-1} D\right)^{-1}\) from the scoring procedure (4) (i.e. the value of this matrix with \(\hat{\beta}\) substituted for \(\beta\)) is a consistent estimate of \(\text{Var}(\hat{\beta})\).
If \(\tilde{V} \neq V\) then the final value of the matrix \(\left(D^{T} V^{-1} D\right)^{-1}\) from (4) is not a consistent estimate of \(\text{Var}(\hat{\beta})\).

Because of these properties, \(\hat{\beta}\) may still be a reasonable estimate of \(\beta\) if \(\tilde{V} \neq V\), but the final value of \((D^T \tilde{V}^{-1}D)^{-1}\) --- often called the "model-based" or "naive" estimator --- will not give accurate standard errors for the elements of \(\hat{\beta}\). However, this problem can be corrected by using the "robust" or "sandwich estimator," defined as

\(\left(D^T \tilde{V}^{-1}D\right)^{-1} \left(D^T \tilde{V}^{-1} E \tilde{V}^{-1} D\right) \left(D^T \tilde{V}^{-1}D\right)^{-1}\quad\), (5)

where,

\(E=\text{Diag}\left((Y_1-\mu_1)^2,\ldots, (Y_N-\mu_N)^2\right)\quad\) , (6)

and all quantities in (5) and (6) are evaluated with \(\hat{\beta}\) substituted for \(\beta\). The sandwich estimator is a consistent estimate of \(\text{Var}(\hat{\beta})\) even when \(\tilde{V} \neq V\).

The sandwich estimator was first proposed by Huber (1967) and White (1980), but was popularized in the late 1980s when Liang and Zeger extended it to multivariate or longitudinal responses.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility