## Thursday, July 2, 2009

### Generalized Linear Models

As I have mentioned, I have spent the last few weeks both in stats books, finding my way around R, and cleaning up and refactoring the code for the generalized linear models in the NiPy models code. I have recently hit a wall in this code, so I am trying to clear out some unposted blog drafts. I intended for this post to introduce the generalized linear models approach to estimation; however the full post will have to wait. For now, I will give an introduction to the theory and then explain where I am with the code.

Generalized linear models was a topic that was completely foreign to me a few weeks ago, but after a little (okay a lot of) reading the approach seems almost natural. I have found the following references useful:

• Jeff Gill's Generalized Linear Models: A Unified Approach.

• James Hardin and Joseph Hilbe's Generalized Linear Models and Extensions, 2nd edition.

• P. McCullagh and John Nelder's Generalized Linear Models, 2nd edition.

The basic point of the generalized linear model is to extend the approach taken in classical linear regression to models that have more complex outcomes but ultimately share the linearity property. In this respect, GLM subsumes classical linear regression, probit and logit analysis, loglinear and multinomial response models, and some models that deal with survival data to name a few.

In my experience, I have found that econometrics is taught in a compartmentalized manner. This makes sense to a certain extent, as different estimators are tailored to particular problems and data. GLM on the other hand allows the use of a common a technique for obtaining parameter estimates so that it can be studied as a single technique rather than as a collection of distinct approaches.

If interested in my ramblings, you can find a draft of my notes as an introduction to GLM here, as Blogger does not support LaTeX... Please note that this is a preliminary and incomplete draft (corrections and clarifications are very welcome). One thing it could definitely use is some clarification by example. However, as I noted, I have run into a bit of a wall trying to extend the binomial family to accept a vector of proportional data, and this is my intended example to walk through the theory and algorithm, so... a subsequent post will have to lay this out once I've got it sorted myself.

Generally speaking, there are two basic algorithms for GLM estimation: one is a maximum likelihood optimization based on Newton's method the other is commonly refered to as iteratively (re)weighted least squares (IRLS or IWLS). Our implementation now only covers IRLS. As will be shown, the algorithm itself is pretty simple. It boils down to regressing the transformed (and updated) outcome variable on the untransformed design matrix weighted by the variance of the transformed observations. This is done until we have convergence of the deviance function (twice the log-likelihood ratio of the current and previous estimates). The problem that I am running into with updating the binomial family to accept proportional data (ie., a vector of pairs (successes, total trials) instead of a vector of 1s and 0s for success or failure) is more mathematical than computational. I have either calculated the variance (and therefore the weights) incorrectly, or I am updating the outcome variable incorrectly. Of course, there's always the remote possibility that my data is not well behaved, but I don't think this is the case here.

More to come...