# 12.3 - Addendum: Estimating Equations and the Sandwich

### Quasi-likelihood

Suppose that we observe responses *y*_{1}, *y*_{1} , ... , *y*_{N} which we want to relate to covariates. The ordinary linear regression model is:

\(y_i \sim N(x_i^T\beta,\sigma^2)\)

where *x _{i}* is a vector of covariates, β is a vector of coefficients to be estimated, and σ

^{2}is the error variance. Now let's generalize this model in two ways:

Introduce a link function for the mean *E *(* y _{i}* ) = μ

*,*

_{i}\(g(\mu_i)=x_i^T\beta\).

We could also write this as:

\(E(y_i)=\mu_i(\beta)\) (2)

where the covariates and the link become part of the function μ* _{i}*. For example, in a loglinear model we would have \(\mu_i=\text{exp}(x_i^T\beta)\).

Allow for heteroscedasticity, so that

\(\text{Var}(y_i)=V_i\). (3)

In many cases *V _{i}* will depend on the mean μ

*and therefore on β, and possibly on additional unknown parameters(e.g. a scale factor). For example, in a traditional overdispersed loglinear model, we would have \(V_i=\sigma^2 \mu_i=\sigma^2 \text{exp}(x_i^T\beta)\).*

_{i}A maximum-likelihood estimate for β under this model could be computed by a Fisher scoring procedure. ML estimates have two nice theoretical properties: they are approximately unbiased and highly efficient. Interestingly, the asymptotic theory underlying these properties does not really depend on the normality of *y _{i}*, but only on the first two moments. That is, if the mean function (2) and the variance function (3) are correct,but the distribution of

*y*is not normal, the estimate of β obtained by maximizing the normal loglikelihood from

_{i}\(y_i \sim N(\mu_i(\beta), V_i(\beta))\)

is still asymptotically unbiased and efficient.

### Quasi-scoring

If we maximize the normality-based loglikelihood without assuming that the response is normally distributed, the resulting estimate of β is called a quasi-likelihood estimate. The iterative procedure for computing the quasi-likelihood estimate, called quasi-scoring, proceeds as follows.

First, let's collect the responses and their means into vectors of length *N*,

\(y=

\begin{bmatrix}

y_1\\

y_2\\

\vdots\\

y_N

\end{bmatrix},

\mu=

\begin{bmatrix}

\mu_1\\

\mu_2\\

\vdots\\

\mu_N

\end{bmatrix} \)

Also, let *V* be the *N* × *N* matrix with *V*_{1} , ... , *V*_{N} on the diagonal and zeros elsewhere. Finally, let

\(D=\dfrac{\partial \mu}{\partial \beta}\)

be the *N* × *p* matrix containing the partial derivatives of μ with respect to β. That is, the (*i*, *j*)th element of *D* is ∂μ* _{i}* / ∂β

*. In a simple linear model with an identity link, \(\mu_i=x_i^T\beta\),*

_{i}*D*is simply the

*N*×

*p*matrix of covariates

\(X=

\begin{bmatrix}

x_1^T\\

x_2^T\\

\vdots\\

x_N^T

\end{bmatrix}\)

The iterative quasi-scoring procedure is

\(\beta_{new}-\beta_{old}=\left( D^T V^{-1} D\right)^{-1} D^T V^{-1}(y-\mu)\) (4)

where on the right-hand side of (4) the quantities *D*, \(\tilde{V}\), and μ are evaluated at β = β_{old}.

It is easy to see that in the case of simple linear regression with homoscedastic response(\(\mu_i=x_i^T\beta\) and* *\(V_i=\sigma^2\) ), the quasi-scoring procedure converges to the OLS estimate

\(\hat{\beta}=\left(X^T X\right)^{-1}X^T y\)

in a single step regardless of βold. In other cases(e.g. loglinear regression where \(\mu_i=\text{exp}(x_i^T\beta)\), \(V_i=\sigma^2 \mu_i\) ) it produces the same estimate for β that we obtained earlier by fitting the generalized linear model.

Quasi-likelihood estimation is really the same thing as generalized linear modeling, except that we no longer have a full parametric model for *y** _{i}* .

If we let *U* be the *p *× 1 vector

\(U=D^TV^{-1}(y-\mu)\)

the final estimate from the quasi-scoring procedure satisfies the condition *U* = 0. *U* can be regarded as the first derivative of the quasi-loglikelihood function. The first derivative of a loglikelihood is called a score, so *U* is called a quasi-score. *U* = 0 is often called the set of **estimating equations**, and the final estimate for β is the solution to the estimating equations.

**Estimating equations with a working variance function.** We'll suppose that the mean regression function μ* _{i}* (β) has been correctly specified but the variance function has not. That is, the data analyst incorrectly supposes that the variance function for

*y*

*is \(\tilde{V}_i\) rather than*

_{i}*V*, where \(\tilde{V}_i\) is another function of β. The analyst then estimates β by solving the quasi-score equations

_{i}\(U=D^T \tilde{V}^{-1}(y-\mu)=0\)

where \(\tilde{V}_i\) is the diagonal matrix of \(\tilde{V}_i 's\). Obviously it would be better to perform the scoring procedure using the true variance function *V* = Diag(*V*_{1} , ... , *V _{n}*) rather than \(\tilde{V}\). What are the properties of this procedure where \(\tilde{V}\) has been used instead of

*V*?

- The estimate \(\hat{\beta}\) is a consistent and asymptotically unbiased estimate of β, even if \(\tilde{V} \neq V\). It's also asymptotically normal.
- If
*\(\tilde{V} \neq V\)*then \(\hat{\beta}\) is not efficient; that is, the asymptotic variance of \(\hat{\beta}\) is lowest when*\(\tilde{V} = V\)*. - If
*\(\tilde{V}= V\)*then the final value of the matrix (*D*^{T}*V*^{-1}*D*)^{-1}from the scoring procedure (4) (i.e. the value of this matrix with \(\hat{\beta}\) substituted for β) is a consistent estimate of \(\text{Var}(\hat{\beta})\). - If
*\(\tilde{V} \neq V\)*then the final value of the matrix (*D*^{T}*V*^{-1}*D*)^{-1}from (4) is not a consistent estimate of \(\text{Var}(\hat{\beta})\).

Because of these properties, \(\hat{\beta}\) may still be a reasonable estimate of β if *\(\tilde{V} \neq V\)*, but the final value of \((D^T \tilde{V}^{-1}D)^{-1}\) —often called the "model-based" or "naive" estimator— will not give accurate standard errors for the elements of \(\hat{\beta}\). However, this problem can be corrected by using the "robust" or "sandwich estimator," defined as

\(\left(D^T \tilde{V}^{-1}D\right)^{-1} \left(D^T \tilde{V}^{-1} E \tilde{V}^{-1} D\right) \left(D^T \tilde{V}^{-1}D\right)^{-1}\), (5)

where,

\(E=\text{Diag}\left((y_1-\mu_1)^2,\ldots, (y_n-\mu_n)^2\right)\) , (6)

and all quantities in (5) and (6) are evaluated with \(\hat{\beta}\) substituted for β. The sandwich estimator is a consistent estimate of \(\text{Var}(\hat{\beta})\) even when *\(\tilde{V} \neq V\)*.

The sandwich estimator was first proposed by Huber (1967) and White (1980), but was popularized in the late 1980's when Liang and Zeger extended it to multivariate or longitudinal responses.