## Saturday, July 25, 2009

### Iterated Reweighted Least Squares

I have spent the last two weeks putting the "finishing" touches on the generalized linear models and starting to go over the robust linear models (RLM). The test suite for GLM is not complete yet, but all of the exponential families are covered with at least their default link functions tested and are looking good. So in an effort to make a first pass over all of the existing code, I moved on to RLM. After the time spent with GLM theory, the RLMs theory and code was much more manageable.

Before discussing the RLMs, their implementation, and the extensions I have made. I will describe the iterated reweighted least squares (IRLS) algorithm for the GLMs to demonstrate the theory and the solution method in the models code. A very similar iteration is done for the RLMs as well.

The main idea of GLM, as noted, is to relate a response variable to a linear model via a link function, which allows us to use least squares regression. Let us take as an example, the binomial family (which is written to handle Bernoulli and binomial data). In this case, the response variable is Bernoulli, 1 indicates a "success" and 0 a "failure".

The default link for the binomial family is the logit link. So we have

$\eta=g(u)=\ln\frac{\mu}{1-\mu}$

η is our linear predictor and μ is our actual mean response. The first thing that we need for the algorithm is to compute a first guess on μ (IRLS as opposed to Newton-Raphson makes a guess on the mean response rather than the parameter vector β). The algorithm is fairly robust to this first guess; however, a common choice is

$\mu=\frac{y+\bar{y}}{2}$

For the binomial family, we specifically use

$\mu=\frac{y+.5}{2}$

where y is our given response variable. We then use this first guess to initialize our linear predictor η via the link function given above. With these estimates we are able to start the iteration. Our convergence criteria is based on the deviance function, which is simply twice the log-likelihood ratio of our current guess on the fitted model versus the saturated log-likelihood.

$D=2\phi\left[ \mathcal{L}\left(y;y \right ) - \mathcal{L}\left(y;\mu \right )\right ]$

Where Φ is a dispersion (scale) parameter. Note that the saturated log-likelihood is simply the likelihood of the perfectly fitted model where y = μ. For the binomial family the deviance function is

$\sum_{i}D_{i}=\begin{cases} -2\ln\left(1-\mu_{i}\right) & \text{, if }y_{i}=0\\ -2\ln\mu_{i} & \text{, if }y_{i}=1\end{cases}$

The iteration continues while the deviance function evaluated at the updated μ differs from the previous by the given convergence tolerance (default is 1e-08) and the number of iterations is less than the given maximum (default is 100).

The actual iterations, as the name of the algorithm suggests, run a weighted least squares fit of the actual regressors on the adjusted linear predictor (our transformed guess on the response variable). The adjustment is given by

$\eta=\frac{\partial\eta}{\partial\mu}\left(y-\mu \right ) \\ \text{Recalling that }\eta=g(\mu)$

which moves us closer to the root of the estimating equation (see the previously mentioned Gill or Hardin and Hilbe for the details of root-finding using the Newton-Raphson method. IRLS is simply Newton-Raphson using some simplifications.). The only remaining ingredient is the choice of weights for the weighted least squares. Similar to other methods such as general least squares (GLS) that correct for heteroskedasticity in the error terms, a diagonal matrix containing estimates of the variance of the error terms is used. However, in our case, the exact form of this variance is unknown and difficult to estimate, but the error each estimate is assumed to vary with the mean response variable. Thus, we improve the weights at each step given our best guess for the mean response variable μ and the known variance function of each family. For the binomial family, as is well known, the variance function is

$V_{i}=\mu(1-\mu)$

Thus we can obtain an estimate of the parameters using weighted least squares. Using these estimates we update η

$\eta=X\beta + \left(\text{offset}\right)$

Correcting by an offset if one is specified (the code currently does not support the use of offsets, but this would be a simple extension). Using this linear predictor (remember it was originally given by the linear transformation η = g(μ)) we update our guess on the mean response variable μ and use this to update our estimate of the deviance function. This is continued until convergence is reached.

Once the fit is obtained, the covariance matrix is obtained as always though in the case of GLM it is weighted by an estimate of the scale chosen when the fit method is called. The default is Pearson's Chi-Squared. The standard errors of the estimate are obtained from the diagonal elements of this matrix.

And that's basically it. Much of my time with the GLM code was spent getting my head around the theory and then this algorithm. Then I had to obtain data (Jeff Gill was kind enough to give permission for SciPy to use and distribute the datasets from his nice monograph) and write tests to ensure that all of the intermediate and final results were correct for each family. This was no small feat considering there are 6 families after I added the negative binomial family and around 25 possible combinations of families and link functions. Figuring out the correct use of the estimated scale (or dispersion) parameters for each family was particularly challenging. As I mentioned, in the interest of time, I haven't written tests for the noncanonical links for each and every family, but the initial results look good and these tests will come.

GLM provided a good base for understanding the remaining code and allowed me to more or less plow through the robust linear models estimator. I had some mathematical difficulties in extending the models to include the correct covariance matrices, since there is no strong theoretical consensus on what they should actually be! More on RLM in my next post, but before then I'll just give a quick look at how the GLM estimator is used.

First, the algorithm described above has been made flexible enough to estimate truly binomial data. That is, we can have a vector with each row containing (# of successes, # of failures) as is the case in the star98 data from Dr. Gill, described and included in the models.datasets. It will be useful to have a look at the syntax for this type of data as it's slightly different than the other families.

In [1]: import models

In [2]: import numpy as np

In [3]: from models.datasets.star98.data import load

In [6]: trials = data.endog[:,:2].sum(axis=1)

In [7]: data.endog[0] # (successes, failures)
Out[7]: array([ 452., 355.])

In [8]: trials[0] # total trials
Out[8]: 807.0

In [9]: from models.glm import GLMtwo as GLM # the name will change

In [10]: bin_model = GLM(data.endog, data.exog, family=models.family.Binomial())

In [11]: bin_results = bin_model.fit(data_weights = trials)

In [12]: bin_results.params
Out[12]:
array([ -1.68150366e-02, 9.92547661e-03, -1.87242148e-02,
-1.42385609e-02, 2.54487173e-01, 2.40693664e-01,
8.04086739e-02, -1.95216050e+00, -3.34086475e-01,
-1.69022168e-01, 4.91670212e-03, -3.57996435e-03,
-1.40765648e-02, -4.00499176e-03, -3.90639579e-03,
9.17143006e-02, 4.89898381e-02, 8.04073890e-03,
2.22009503e-04, -2.24924861e-03, 2.95887793e+00])

The main difference between the above and the rest of the families is that you must specify the total number of "trials" (which as you can see is just the sum of success and failures for each observation) as the data_weights argument to fit. This was done so that the current implementation of the Bernouilli family could be extended without rewriting its other derived functions. The code could easily (and might be) extended to calculate these trials so that this argument doesn't need to be specified, but it's sometimes better to be explicit.