Scipy Stats Project: GLM Residuals and The Beauty of Stats with Python + SciPy

I just finished including the Anscombe residuals for the families in the generalized linear models. The Anscombe residuals for the Binomial family were particularly tricky. It took me a while to work through the math and then figure out the SciPy syntax for what I need (some docs clarification coming...), but if I had had to implement these functions myself (presumably not in Python), it would have taken me more than a week! I thought it provided a good opportunity introduce the residuals and to show off how easy things are in Python with NumPy/SciPy.

Note that there is not really a unified terminology for residual analysis in GLMs. I will try to use the most common names for these residuals in introducing the basics and point out any deviations from the norm both here and in the SciPy documentation.

In general, residuals should be defined such that their distribution is approximately normal. This is achieved through the use of linear equations, transformed linear equations, and deviance contributions. Below you will find the five most common types of residuals that rely mainly on transformations and one that relies on deviance contributions, though there are as many as 9+ different types of residuals in the literature.

Response Residuals

These are simply the observed response values minus the predicted values.

$r_{i}^{R}=y_{i}-\hat{\mu}_{i}$

In a classic linear model these are just the expected residuals

$Y-X\beta$

However, for GLM, these become

$Y-g^{-1}(X\beta)$

where $g(\cdot)$ is the link function that makes our model's linear predictor comparable to response vector.

It is, however, common in GLMs to produce residuals that deviate greatly from the classical assumptions needed in residual analysis, so in addition we have these other residuals which attempt to mitigate deviations from the needed assumptions.

Pearson Residuals

The Pearson residuals are a version of the response residuals, scaled by the standard deviation of the prediction. The name comes from the fact that the sum of the Pearson residuals for a Poisson GLM is equal to Pearson's $\chi^{2}$ statistic, a goodness of fit measure.

$r_{i}^{P}=\frac{(y_{i}-\hat{\mu_{i}})}{\sqrt{\text{VAR}[\hat{\mu}_{i}]}}$

The scaling allows plotting of these residuals versus an individual predictor or the outcome to identify outliers. The problem with these residuals though is that asymptotically they are normally distributed, but in practice they can be quite skewed leading to a misinterpretation of the actual dispersion.

Working Residuals

These are the difference between the working response and the linear predictor at convergence (ie., the last step in the iterative process).
$r_{i}^{W}=\left(y_{i}-\hat{\mu}_{i}\right)\left.\left(\frac{\partial\eta}{\partial\mu}\right)_{i}\right|_{\mu=\hat{\mu}}$

Anscombe Residuals

Anscombe (1960, 1961) describes a general transformation

in place of

so that they are as close to a normal distribution as possible. (Note: "There is a maddeningly great diversity of the forms that the Anscombe residuals take in the literature." I have included the simplest as described below. (Gill 54, emphasis added)) The function $A(\cdot)$ is given by

$A(\cdot)=\int\text{VAR}[\mu]^{-\frac{1}{3}}d\mu$

This is done for both the response and the predictions. This difference is then scaled by dividing by

$A^{\prime}\left(\hat{\mu}_{i} \right )\sqrt{\text{VAR}\left(\hat{\mu}_i \right )}$

so that the Anscombe Residuals are

$r_{i}^{A}=\frac{A\left(y_{i} \right )-A\left(\hat{\mu}_{i} \right )}{A^{\prime}\left(\hat{\mu}_{i}\right)\sqrt{\text{VAR}\left(\hat{\mu}_{i} \right )} \right )}$

The Poisson distribution has constant variance $\mu$ so that it's Anscombe Residuals are simply

$\frac{3\left(y_{i}^{\frac{2}{3}}-\hat{\mu}_i^{\frac{2}{3}} \right )}{2\hat{\mu}_{i}^{\frac{1}{6}}}$

Easy right? Sure was until I ran into the binomial distribution. The Anscombe residuals are built up in a different way for the binomial family. Indeed, the McCullagh and Nelder text does not even provide the general formula for the binomial Anscombe residuals and refers the reader to Anscombe (1953) and Cox & Snell (1968). The problem is that following this transformation for the binomial distribution leads to a rather nasty solution involving the hypergeometric 2F1 function or equivalently the incomplete beta function multiplied by the beta function as shown by Cox and Snell (1968).

Gill writes "A partial reason that Anscombe residuals are less popular than other forms is the difficulty in obtaining these tabular values." The tabular values to which he is referring are found in Cox and Snell (1968) p. 260. It is a table of the incomplete beta function that was tabulated numerically for an easy reference. How difficult would it be to get this table with NumPy and SciPy?

import numpy as np
from scipy import special
betainc = lambda x: special.betainc(2/3.,2/3.,x)

table = np.arange(0,.5,.001).reshape(50,10)
results = []
for i in table:
results.append(betainc(i))

results = np.asarray(results).reshape(50,10)

That's it!

Now the Anscombe residuals for the binomial distribution are
$r_{i}^{A}=\frac{\left(\phi\left(\frac{Y_i}{n_i} \right )-\phi{\left( \hat{\mu}_{i}\right )} \right )}{\hat{\mu}_{i}^{\frac{1}{6}}\left(1-\hat{\mu}_{i} \right )^{\frac{1}{6}}}\ \\ \text{ where } \phi\left(\mu\right)=\int_{0}^{\mu}t^{-\frac{1}{3}}\left(1-t \right )^{-\frac{1}{3}}=I_{\mu}\left(\frac{2}{3},\frac{2}{3} \right )B\left(\frac{2}{3},\frac{2}{3} \right )$

Where n is the number of trials for each observation (ie., 1,

, or $n_{i}$ )

To implement this in the GLM binomial family class, I defined an intermediate function cox_snell similar to the above. It now looks like

from scipy import special
import numpy as np

def resid_anscombe(self, Y, mu):
cox_snell = lambda x: special.betainc(2/3.,2/3.,x)*special.beta(2/3.,2/3.)
return np.sqrt(self.n)*(cox_snell(Y)-cox_snell(mu))/(mu**(1/6.)*(1-mu)**(1/6.))

We multiply the above formula by np.sqrt(self.n) the square root of the number of trials in order to correct for possible use of proportional outcomes Y and mu.

Also, see the ticket here for a note about the incomplete beta function.

Deviance Residuals

One other important type of residual in GLMs is the deviance residual. The deviance residual is the most general and also the most useful of the GLM residuals. The IRLS algorithm (as will be shown in a future post) depends on the convergence of the deviance function. The deviance residual then is just the increment to the overall deviance of each observation.

$r_{i}^{D}=\text{sign}\left(y_i-\hat{\mu}_{i} \right )\sqrt{\hat{d}_i^{2}}$

where $\hat{d}_{i}^{2}$ are defined for each family.

Note that most of these residuals also come in variations such as modified, standardized, studentized, and adjusted.

Selected References

Anscombe, FJ (1953) "Contribution to the discussion of H. Hotelling's paper."
   Journal of the Royal Statistical Society, B, 15, 229-30.

Anscombe, FJ (1960) "Rejection of outliers." Technometrics, 2,
  123-47.

Anscombe, FJ (1961) "Examination of residuals." Proceedings of
  the Fourth Berkeley Symposium on Mathematical Statistics and
  Probability. Berkeley: University of California Press.

Cox, DR and Snell, EJ (1968). "A generalized definition of
  residuals." Journal of the Royal Statistical Society B, 30, 248-65.

Scipy Stats Project

Saturday, July 11, 2009

GLM Residuals and The Beauty of Stats with Python + SciPy

Response Residuals

Pearson Residuals

Working Residuals

Anscombe Residuals

Deviance Residuals

Selected References

No comments:

Post a Comment

Blog Archive

My Google Reader Shared Items

Contributors

Followers