Scipy Stats Project: July 2009

Saturday, July 25, 2009

Iterated Reweighted Least Squares

I have spent the last two weeks putting the "finishing" touches on the generalized linear models and starting to go over the robust linear models (RLM). The test suite for GLM is not complete yet, but all of the exponential families are covered with at least their default link functions tested and are looking good. So in an effort to make a first pass over all of the existing code, I moved on to RLM. After the time spent with GLM theory, the RLMs theory and code was much more manageable.

Before discussing the RLMs, their implementation, and the extensions I have made. I will describe the iterated reweighted least squares (IRLS) algorithm for the GLMs to demonstrate the theory and the solution method in the models code. A very similar iteration is done for the RLMs as well.

The main idea of GLM, as noted, is to relate a response variable to a linear model via a link function, which allows us to use least squares regression. Let us take as an example, the binomial family (which is written to handle Bernoulli and binomial data). In this case, the response variable is Bernoulli, 1 indicates a "success" and 0 a "failure".

The default link for the binomial family is the logit link. So we have

$\eta=g(u)=\ln\frac{\mu}{1-\mu}$

η is our linear predictor and μ is our actual mean response. The first thing that we need for the algorithm is to compute a first guess on μ (IRLS as opposed to Newton-Raphson makes a guess on the mean response rather than the parameter vector β). The algorithm is fairly robust to this first guess; however, a common choice is

$\mu=\frac{y+\bar{y}}{2}$

For the binomial family, we specifically use

$\mu=\frac{y+.5}{2}$

where y is our given response variable. We then use this first guess to initialize our linear predictor η via the link function given above. With these estimates we are able to start the iteration. Our convergence criteria is based on the deviance function, which is simply twice the log-likelihood ratio of our current guess on the fitted model versus the saturated log-likelihood.

$D=2\phi\left[ \mathcal{L}\left(y;y \right ) - \mathcal{L}\left(y;\mu \right )\right ]$

Where Φ is a dispersion (scale) parameter. Note that the saturated log-likelihood is simply the likelihood of the perfectly fitted model where y = μ. For the binomial family the deviance function is

$\sum_{i}D_{i}=\begin{cases} -2\ln\left(1-\mu_{i}\right) & \text{, if }y_{i}=0\\ -2\ln\mu_{i} & \text{, if }y_{i}=1\end{cases}$

The iteration continues while the deviance function evaluated at the updated μ differs from the previous by the given convergence tolerance (default is 1e-08) and the number of iterations is less than the given maximum (default is 100).

The actual iterations, as the name of the algorithm suggests, run a weighted least squares fit of the actual regressors on the adjusted linear predictor (our transformed guess on the response variable). The adjustment is given by

$\eta=\frac{\partial\eta}{\partial\mu}\left(y-\mu \right ) \\ \text{Recalling that }\eta=g(\mu)$

which moves us closer to the root of the estimating equation (see the previously mentioned Gill or Hardin and Hilbe for the details of root-finding using the Newton-Raphson method. IRLS is simply Newton-Raphson using some simplifications.). The only remaining ingredient is the choice of weights for the weighted least squares. Similar to other methods such as general least squares (GLS) that correct for heteroskedasticity in the error terms, a diagonal matrix containing estimates of the variance of the error terms is used. However, in our case, the exact form of this variance is unknown and difficult to estimate, but the error each estimate is assumed to vary with the mean response variable. Thus, we improve the weights at each step given our best guess for the mean response variable μ and the known variance function of each family. For the binomial family, as is well known, the variance function is

$V_{i}=\mu(1-\mu)$

Thus we can obtain an estimate of the parameters using weighted least squares. Using these estimates we update η

$\eta=X\beta + \left(\text{offset}\right)$

Correcting by an offset if one is specified (the code currently does not support the use of offsets, but this would be a simple extension). Using this linear predictor (remember it was originally given by the linear transformation η = g(μ)) we update our guess on the mean response variable μ and use this to update our estimate of the deviance function. This is continued until convergence is reached.

Once the fit is obtained, the covariance matrix is obtained as always though in the case of GLM it is weighted by an estimate of the scale chosen when the fit method is called. The default is Pearson's Chi-Squared. The standard errors of the estimate are obtained from the diagonal elements of this matrix.

And that's basically it. Much of my time with the GLM code was spent getting my head around the theory and then this algorithm. Then I had to obtain data (Jeff Gill was kind enough to give permission for SciPy to use and distribute the datasets from his nice monograph) and write tests to ensure that all of the intermediate and final results were correct for each family. This was no small feat considering there are 6 families after I added the negative binomial family and around 25 possible combinations of families and link functions. Figuring out the correct use of the estimated scale (or dispersion) parameters for each family was particularly challenging. As I mentioned, in the interest of time, I haven't written tests for the noncanonical links for each and every family, but the initial results look good and these tests will come.

GLM provided a good base for understanding the remaining code and allowed me to more or less plow through the robust linear models estimator. I had some mathematical difficulties in extending the models to include the correct covariance matrices, since there is no strong theoretical consensus on what they should actually be! More on RLM in my next post, but before then I'll just give a quick look at how the GLM estimator is used.

First, the algorithm described above has been made flexible enough to estimate truly binomial data. That is, we can have a vector with each row containing (# of successes, # of failures) as is the case in the star98 data from Dr. Gill, described and included in the models.datasets. It will be useful to have a look at the syntax for this type of data as it's slightly different than the other families.

In [1]: import models

In [2]: import numpy as np

In [3]: from models.datasets.star98.data import load

In [4]: data = load()

In [5]: data.exog = models.functions.add_constant(data.exog)

In [6]: trials = data.endog[:,:2].sum(axis=1)

In [7]: data.endog[0] # (successes, failures)
Out[7]: array([ 452., 355.])

In [8]: trials[0] # total trials
Out[8]: 807.0

In [9]: from models.glm import GLMtwo as GLM # the name will change

In [10]: bin_model = GLM(data.endog, data.exog, family=models.family.Binomial())

In [11]: bin_results = bin_model.fit(data_weights = trials)

In [12]: bin_results.params
Out[12]:
array([ -1.68150366e-02, 9.92547661e-03, -1.87242148e-02,
-1.42385609e-02, 2.54487173e-01, 2.40693664e-01,
8.04086739e-02, -1.95216050e+00, -3.34086475e-01,
-1.69022168e-01, 4.91670212e-03, -3.57996435e-03,
-1.40765648e-02, -4.00499176e-03, -3.90639579e-03,
9.17143006e-02, 4.89898381e-02, 8.04073890e-03,
2.22009503e-04, -2.24924861e-03, 2.95887793e+00])

The main difference between the above and the rest of the families is that you must specify the total number of "trials" (which as you can see is just the sum of success and failures for each observation) as the data_weights argument to fit. This was done so that the current implementation of the Bernouilli family could be extended without rewriting its other derived functions. The code could easily (and might be) extended to calculate these trials so that this argument doesn't need to be specified, but it's sometimes better to be explicit.

Saturday, July 11, 2009

GLM Residuals and The Beauty of Stats with Python + SciPy

I just finished including the Anscombe residuals for the families in the generalized linear models. The Anscombe residuals for the Binomial family were particularly tricky. It took me a while to work through the math and then figure out the SciPy syntax for what I need (some docs clarification coming...), but if I had had to implement these functions myself (presumably not in Python), it would have taken me more than a week! I thought it provided a good opportunity introduce the residuals and to show off how easy things are in Python with NumPy/SciPy.

Note that there is not really a unified terminology for residual analysis in GLMs. I will try to use the most common names for these residuals in introducing the basics and point out any deviations from the norm both here and in the SciPy documentation.

In general, residuals should be defined such that their distribution is approximately normal. This is achieved through the use of linear equations, transformed linear equations, and deviance contributions. Below you will find the five most common types of residuals that rely mainly on transformations and one that relies on deviance contributions, though there are as many as 9+ different types of residuals in the literature.

Response Residuals

These are simply the observed response values minus the predicted values.

$r_{i}^{R}=y_{i}-\hat{\mu}_{i}$

In a classic linear model these are just the expected residuals

$Y-X\beta$

However, for GLM, these become

$Y-g^{-1}(X\beta)$

where $g(\cdot)$ is the link function that makes our model's linear predictor comparable to response vector.

It is, however, common in GLMs to produce residuals that deviate greatly from the classical assumptions needed in residual analysis, so in addition we have these other residuals which attempt to mitigate deviations from the needed assumptions.

Pearson Residuals

The Pearson residuals are a version of the response residuals, scaled by the standard deviation of the prediction. The name comes from the fact that the sum of the Pearson residuals for a Poisson GLM is equal to Pearson's $\chi^{2}$ statistic, a goodness of fit measure.

$r_{i}^{P}=\frac{(y_{i}-\hat{\mu_{i}})}{\sqrt{\text{VAR}[\hat{\mu}_{i}]}}$

The scaling allows plotting of these residuals versus an individual predictor or the outcome to identify outliers. The problem with these residuals though is that asymptotically they are normally distributed, but in practice they can be quite skewed leading to a misinterpretation of the actual dispersion.

Working Residuals

These are the difference between the working response and the linear predictor at convergence (ie., the last step in the iterative process).
$r_{i}^{W}=\left(y_{i}-\hat{\mu}_{i}\right)\left.\left(\frac{\partial\eta}{\partial\mu}\right)_{i}\right|_{\mu=\hat{\mu}}$

Anscombe Residuals

Anscombe (1960, 1961) describes a general transformation

in place of

so that they are as close to a normal distribution as possible. (Note: "There is a maddeningly great diversity of the forms that the Anscombe residuals take in the literature." I have included the simplest as described below. (Gill 54, emphasis added)) The function $A(\cdot)$ is given by

$A(\cdot)=\int\text{VAR}[\mu]^{-\frac{1}{3}}d\mu$

This is done for both the response and the predictions. This difference is then scaled by dividing by

$A^{\prime}\left(\hat{\mu}_{i} \right )\sqrt{\text{VAR}\left(\hat{\mu}_i \right )}$

so that the Anscombe Residuals are

$r_{i}^{A}=\frac{A\left(y_{i} \right )-A\left(\hat{\mu}_{i} \right )}{A^{\prime}\left(\hat{\mu}_{i}\right)\sqrt{\text{VAR}\left(\hat{\mu}_{i} \right )} \right )}$

The Poisson distribution has constant variance $\mu$ so that it's Anscombe Residuals are simply

$\frac{3\left(y_{i}^{\frac{2}{3}}-\hat{\mu}_i^{\frac{2}{3}} \right )}{2\hat{\mu}_{i}^{\frac{1}{6}}}$

Easy right? Sure was until I ran into the binomial distribution. The Anscombe residuals are built up in a different way for the binomial family. Indeed, the McCullagh and Nelder text does not even provide the general formula for the binomial Anscombe residuals and refers the reader to Anscombe (1953) and Cox & Snell (1968). The problem is that following this transformation for the binomial distribution leads to a rather nasty solution involving the hypergeometric 2F1 function or equivalently the incomplete beta function multiplied by the beta function as shown by Cox and Snell (1968).

Gill writes "A partial reason that Anscombe residuals are less popular than other forms is the difficulty in obtaining these tabular values." The tabular values to which he is referring are found in Cox and Snell (1968) p. 260. It is a table of the incomplete beta function that was tabulated numerically for an easy reference. How difficult would it be to get this table with NumPy and SciPy?

import numpy as np
from scipy import special
betainc = lambda x: special.betainc(2/3.,2/3.,x)

table = np.arange(0,.5,.001).reshape(50,10)
results = []
for i in table:
results.append(betainc(i))

results = np.asarray(results).reshape(50,10)

That's it!

Now the Anscombe residuals for the binomial distribution are
$r_{i}^{A}=\frac{\left(\phi\left(\frac{Y_i}{n_i} \right )-\phi{\left( \hat{\mu}_{i}\right )} \right )}{\hat{\mu}_{i}^{\frac{1}{6}}\left(1-\hat{\mu}_{i} \right )^{\frac{1}{6}}}\ \\ \text{ where } \phi\left(\mu\right)=\int_{0}^{\mu}t^{-\frac{1}{3}}\left(1-t \right )^{-\frac{1}{3}}=I_{\mu}\left(\frac{2}{3},\frac{2}{3} \right )B\left(\frac{2}{3},\frac{2}{3} \right )$

Where n is the number of trials for each observation (ie., 1,

, or $n_{i}$ )

To implement this in the GLM binomial family class, I defined an intermediate function cox_snell similar to the above. It now looks like

from scipy import special
import numpy as np

def resid_anscombe(self, Y, mu):
cox_snell = lambda x: special.betainc(2/3.,2/3.,x)*special.beta(2/3.,2/3.)
return np.sqrt(self.n)*(cox_snell(Y)-cox_snell(mu))/(mu**(1/6.)*(1-mu)**(1/6.))

We multiply the above formula by np.sqrt(self.n) the square root of the number of trials in order to correct for possible use of proportional outcomes Y and mu.

Also, see the ticket here for a note about the incomplete beta function.

Deviance Residuals

One other important type of residual in GLMs is the deviance residual. The deviance residual is the most general and also the most useful of the GLM residuals. The IRLS algorithm (as will be shown in a future post) depends on the convergence of the deviance function. The deviance residual then is just the increment to the overall deviance of each observation.

$r_{i}^{D}=\text{sign}\left(y_i-\hat{\mu}_{i} \right )\sqrt{\hat{d}_i^{2}}$

where $\hat{d}_{i}^{2}$ are defined for each family.

Note that most of these residuals also come in variations such as modified, standardized, studentized, and adjusted.

Selected References

Anscombe, FJ (1953) "Contribution to the discussion of H. Hotelling's paper."
   Journal of the Royal Statistical Society, B, 15, 229-30.

Anscombe, FJ (1960) "Rejection of outliers." Technometrics, 2,
  123-47.

Anscombe, FJ (1961) "Examination of residuals." Proceedings of
  the Fourth Berkeley Symposium on Mathematical Statistics and
  Probability. Berkeley: University of California Press.

Cox, DR and Snell, EJ (1968). "A generalized definition of
  residuals." Journal of the Royal Statistical Society B, 30, 248-65.

Test Post

I am testing out my new ASCIIMathML script, so I can type LaTeX in blogger.

Note that to correctly view this page requires Internet Explorer 6 + MathPlayer or Firefox/Mozilla/Netscape (with MathML fonts). So the equations won't work in a feed reader.

$A(\cdot)=\int\,$VAR$\left[\mu\right]^{-\frac{1}{3}}\, d\mu$

Well it appears to work pretty well, except the \text{} tag doesn't work. Hmm, though it should.

This works.
amath
` text(TEXT) `
endamath

This doesn't work.
$\text{TEXT}$

You can get the ASCIIMathML script here and follow these directions to add a javascript widget.

Another possible solution for typing equations in Blogger relies on typing LaTeX equations into a remote site. This didn't really appeal to me, as I don't foresee needing any overly complicated mathematical formulae, though it seems like a perfectly workable solution. You can read about it here
[Edit: I might switch to this approach, since the equations don't display if you don't have the above mentioned requirements or are in a feed reader.]

Hat tip to Please Make a Note for the pointers.

Thursday, July 2, 2009

Generalized Linear Models

As I have mentioned, I have spent the last few weeks both in stats books, finding my way around R, and cleaning up and refactoring the code for the generalized linear models in the NiPy models code. I have recently hit a wall in this code, so I am trying to clear out some unposted blog drafts. I intended for this post to introduce the generalized linear models approach to estimation; however the full post will have to wait. For now, I will give an introduction to the theory and then explain where I am with the code.

Generalized linear models was a topic that was completely foreign to me a few weeks ago, but after a little (okay a lot of) reading the approach seems almost natural. I have found the following references useful:

Jeff Gill's Generalized Linear Models: A Unified Approach.

James Hardin and Joseph Hilbe's Generalized Linear Models and Extensions, 2nd edition.

P. McCullagh and John Nelder's Generalized Linear Models, 2nd edition.

The basic point of the generalized linear model is to extend the approach taken in classical linear regression to models that have more complex outcomes but ultimately share the linearity property. In this respect, GLM subsumes classical linear regression, probit and logit analysis, loglinear and multinomial response models, and some models that deal with survival data to name a few.

In my experience, I have found that econometrics is taught in a compartmentalized manner. This makes sense to a certain extent, as different estimators are tailored to particular problems and data. GLM on the other hand allows the use of a common a technique for obtaining parameter estimates so that it can be studied as a single technique rather than as a collection of distinct approaches.

If interested in my ramblings, you can find a draft of my notes as an introduction to GLM here, as Blogger does not support LaTeX... Please note that this is a preliminary and incomplete draft (corrections and clarifications are very welcome). One thing it could definitely use is some clarification by example. However, as I noted, I have run into a bit of a wall trying to extend the binomial family to accept a vector of proportional data, and this is my intended example to walk through the theory and algorithm, so... a subsequent post will have to lay this out once I've got it sorted myself.

Generally speaking, there are two basic algorithms for GLM estimation: one is a maximum likelihood optimization based on Newton's method the other is commonly refered to as iteratively (re)weighted least squares (IRLS or IWLS). Our implementation now only covers IRLS. As will be shown, the algorithm itself is pretty simple. It boils down to regressing the transformed (and updated) outcome variable on the untransformed design matrix weighted by the variance of the transformed observations. This is done until we have convergence of the deviance function (twice the log-likelihood ratio of the current and previous estimates). The problem that I am running into with updating the binomial family to accept proportional data (ie., a vector of pairs (successes, total trials) instead of a vector of 1s and 0s for success or failure) is more mathematical than computational. I have either calculated the variance (and therefore the weights) incorrectly, or I am updating the outcome variable incorrectly. Of course, there's always the remote possibility that my data is not well behaved, but I don't think this is the case here.

More to come...

Project Status

I have been making slow but steady progress on the NiPy models code. Right now for the midterm review, we have been focusing on design issues including the user interface and refactoring, test coverage/bug fixing, and some extensions for postestimation statistics. Other than this, I have spent the last month or so with anywhere from ten to fifteen stats, econometrics, or numerical linear algebra and optimization texts open on my desk.

The main estimators currently included in the code are generalized least squares, ordinary least squares, weighted least squares, autoregressive AR(p), generalized linear models (with several available distribution families and corresponding link functions), robust linear models, general additive models, and mixed effects models. The test coverage is starting to look pretty good, then there is just squashing the few remaining bugs and improving the postestimation statistics.

Some enhancements have also been made to the code. I have started to include some public domain or appropriately copyrighted datasets for testing purposes that could also be useful for examples and tutorials, so that every usage example doesn't have to start with generating your own random data. I have followed pretty closely to the datasets proposal in the Scikits Learn package.

We have also decided to break from the formula framework that is used in NiPy. It was in flux (being changed to take advantage of SymPy the last I heard) and is intended to be somewhat similar to the formula framework in R. In its place for now, I have written some convenience functions to append a constant to a design matrix or to handle categorical variables for estimation. For the moment, a typical model/estimator is used as

In [1]: from models.regression import OLS

In [2]: from models.datasets.longley.data import load

In [3]: from models.functions import add_constant

In [4]: data = load()

In [5]: data.exog = add_constant(data.exog)

In [6]: model = OLS(data.endog, data.exog)

In [7]: results = model.fit()

In [8]: results.params
Out[8]:
array([ 1.50618723e+01, -3.58191793e-02, -2.02022980e+00,
-1.03322687e+00, -5.11041057e-02, 1.82915146e+03,
-3.48225863e+06])

Barring any unforeseen difficulties, the models code should be available as a standalone package shortly after the midterm evaluation rapidly approaching in ten days. The second half of the summer will then be focused on optimizing the code, finalizing design issues, extending the models, and writing good documentation and tutorials so that the code can be included in SciPy!

Wednesday, July 1, 2009

Econometrics with Python

There is as yet no equivalent of R in applied econometrics. Therefore, the econometric community can still decide to go along the Python path.

That is Drs. Christine Choirat and Raffello Seri writing in the April issue of the Journal of Applied Econometrics. They have been kind enough to provide me with an ungated copy of their review, "Econometrics with Python." Mentioning the, quite frankly, redundant general programming functions and tools that had to be implemented for R, the authors make a nice case for Python as the programming language of choice for applied econometrics. The article provides a quick overview of some of the advantages of using Python and its many built-in libraries, extensions, and tools, gives some speed comparisons, and also mentions a few of the many tools out there in Python community for econometrics including RPy (RPy2 is now available), and of course NumPy and SciPy. Having spent the last week or more trying to master the basic syntax and usage of R, I very much sympathize with this position. The one complaint I hear most often from my fellow students is that Python is not an industry standard. I hope this can change and is changing, because it's much more of a pleasure to work with Python than the alternatives and that makes for increased productivity.

Scipy Stats Project