12.3 - Addendum: Estimating Equations and the Sandwich

12.3 - Addendum: Estimating Equations and the Sandwich

Quasi-likelihood

Suppose that we observe responses $$Y_1,Y_1 , \dots , Y_N$$ which we want to relate to covariates. The ordinary linear regression model is

$$Y_i \sim N(x_i^T\beta,\sigma^2)$$

where $$x_i$$ is a vector of covariates, $$\beta$$ is a vector of coefficients to be estimated, and $$\sigma^2$$ is the error variance. Now let's generalize this model in two ways:

Introduce a link function for the mean $$E ( Y_i ) = \mu_i$$,

$$g(\mu_i)=x_i^T\beta$$.

We could also write this as

$$E(Y_i)=\mu_i(\beta)\quad$$ (2)

where the covariates and the link become part of the function $$\mu_i$$. For example, in a loglinear model we would have $$\mu_i=\exp(x_i^T\beta)$$.

Allow for heteroscedasticity, so that

$$\text{Var}(Y_i)=V_i\quad$$ (3)

In many cases $$V_i$$ will depend on the mean $$\mu_i$$ and therefore on $$\beta$$, and possibly on additional unknown parameters(e.g. a scale factor). For example, in a traditional overdispersed loglinear model, we would have $$V_i=\sigma^2 \mu_i=\sigma^2 \exp(x_i^T\beta)$$.

A maximum-likelihood estimate for $$\beta$$ under this model could be computed by a Fisher scoring procedure. ML estimates have two nice theoretical properties: they are approximately unbiased and highly efficient. Interestingly, the asymptotic theory underlying these properties does not really depend on the normality of $$Y_i$$, but only on the first two moments. That is if the mean function (2) and the variance function (3) are correct, but the distribution of $$Y_i$$ is not normal, the estimate of $$\beta$$ obtained by maximizing the normal loglikelihood from

$$Y_i \sim N(\mu_i(\beta), V_i(\beta))$$

is still asymptotically unbiased and efficient.

Quasi-scoring

If we maximize the normality-based loglikelihood without assuming that the response is normally distributed, the resulting estimate of $$\beta$$ is called a quasi-likelihood estimate. The iterative procedure for computing the quasi-likelihood estimate, called quasi-scoring, proceeds as follows.

First, let's collect the responses and their means into vectors of length $$N$$,

$$Y= \begin{bmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{bmatrix}, \mu= \begin{bmatrix} \mu_1\\ \mu_2\\ \vdots\\ \mu_N \end{bmatrix}$$

Also, let $$V$$ be the $$N \times N$$ matrix with $$V_1 , \dots , V_N$$ on the diagonal and zeros elsewhere. Finally, let

$$D=\dfrac{\partial \mu}{\partial \beta}$$

be the $$N \times p$$ matrix containing the partial derivatives of $$\mu$$ with respect to $$\beta$$. That is, the $$\left(i, j\right)^{th}$$ element of $$D$$ is $$\dfrac{∂\mu_i }{∂\beta_i}$$ . In a simple linear model with an identity link, $$\mu_i=x_i^T\beta$$, $$D$$ is simply the $$N \times p$$ matrix of covariates

$$X= \begin{bmatrix} x_1^T\\ x_2^T\\ \vdots\\ x_N^T \end{bmatrix}$$

The iterative quasi-scoring procedure is

$$\beta_{\mbox{new}}-\beta_{\mbox{old}}= \left( D^T V^{-1} D\right)^{-1} D^T V^{-1}(Y-\mu)\quad$$ (4)

where on the right-hand side of (4) the quantities $$D$$, $$\tilde{V}$$, and $$\mu$$ are evaluated at $$\beta = \beta_{\mbox{old}}$$.

It is easy to see that in the case of simple linear regression with homoscedastic response ($$\mu_i=x_i^T\beta$$ and $$V_i=\sigma^2$$ ), the quasi-scoring procedure converges to the OLS estimate

$$\hat{\beta}=\left(X^T X\right)^{-1}X^T Y$$

in a single step regardless of $$\beta_{\mbox{old}}$$. In other cases (e.g. loglinear regression where $$\mu_i=\exp(x_i^T\beta)$$, $$V_i=\sigma^2 \mu_i$$ ) it produces the same estimate for $$\beta$$ that we obtained earlier by fitting the generalized linear model.

Quasi-likelihood estimation is really the same thing as generalized linear modeling, except that we no longer have a full parametric model for $$Y_i$$.

If we let $$U$$ be the $$p \times 1$$ vector

$$U=D^TV^{-1}(Y-\mu)$$

the final estimate from the quasi-scoring procedure satisfies the condition $$U = 0$$. $$U$$ can be regarded as the first derivative of the quasi-loglikelihood function. The first derivative of a loglikelihood is called a score, so $$U$$ is called a quasi-score. $$U = 0$$ is often called the set of estimating equations, and the final estimate for $$\beta$$ is the solution to the estimating equations.

Estimating equations with a working variance function. We'll suppose that the mean regression function $$\mu_i \left(\beta\right)$$ has been correctly specified but the variance function has not. That is, the data analyst incorrectly supposes that the variance function for $$Y_i$$ is $$\tilde{V}_i$$ rather than $$V_i$$ , where $$\tilde{V}_i$$ is another function of $$\beta$$. The analyst then estimates $$\beta$$ by solving the quasi-score equations

$$U=D^T \tilde{V}^{-1}(Y-\mu)=0$$

where $$\tilde{V}_i$$ is the diagonal matrix of $$\tilde{V}_i$$s. Obviously it would be better to perform the scoring procedure using the true variance function $$V = \text{Diag}\left(V_1 , \dots , V_N\right)$$ rather than $$\tilde{V}$$. What are the properties of this procedure where $$\tilde{V}$$ has been used instead of $$V$$?

1. The estimate $$\hat{\beta}$$ is a consistent and asymptotically unbiased estimate of $$\beta$$, even if $$\tilde{V} \neq V$$. It's also asymptotically normal.
2. If $$\tilde{V} \neq V$$ then $$\hat{\beta}$$ is not efficient; that is, the asymptotic variance of $$\hat{\beta}$$ is lowest when $$\tilde{V} = V$$.
3. If $$\tilde{V}= V$$ then the final value of the matrix $$\left(D^T V^{-1} D\right)^{-1}$$ from the scoring procedure (4) (i.e. the value of this matrix with $$\hat{\beta}$$ substituted for $$\beta$$) is a consistent estimate of $$\text{Var}(\hat{\beta})$$.
4. If $$\tilde{V} \neq V$$ then the final value of the matrix $$\left(D^{T} V^{-1} D\right)^{-1}$$ from (4) is not a consistent estimate of $$\text{Var}(\hat{\beta})$$.

Because of these properties, $$\hat{\beta}$$ may still be a reasonable estimate of $$\beta$$ if $$\tilde{V} \neq V$$, but the final value of $$(D^T \tilde{V}^{-1}D)^{-1}$$ --- often called the "model-based" or "naive" estimator --- will not give accurate standard errors for the elements of $$\hat{\beta}$$. However, this problem can be corrected by using the "robust" or "sandwich estimator," defined as

$$\left(D^T \tilde{V}^{-1}D\right)^{-1} \left(D^T \tilde{V}^{-1} E \tilde{V}^{-1} D\right) \left(D^T \tilde{V}^{-1}D\right)^{-1}\quad$$, (5)

where,

$$E=\text{Diag}\left((Y_1-\mu_1)^2,\ldots, (Y_N-\mu_N)^2\right)\quad$$ , (6)

and all quantities in (5) and (6) are evaluated with $$\hat{\beta}$$ substituted for $$\beta$$. The sandwich estimator is a consistent estimate of $$\text{Var}(\hat{\beta})$$ even when $$\tilde{V} \neq V$$.

The sandwich estimator was first proposed by Huber (1967) and White (1980), but was popularized in the late 1980s when Liang and Zeger extended it to multivariate or longitudinal responses.

 [1] Link ↥ Has Tooltip/Popover Toggleable Visibility