Thursday, July 2, 2009

Project Status

I have been making slow but steady progress on the NiPy models code. Right now for the midterm review, we have been focusing on design issues including the user interface and refactoring, test coverage/bug fixing, and some extensions for postestimation statistics. Other than this, I have spent the last month or so with anywhere from ten to fifteen stats, econometrics, or numerical linear algebra and optimization texts open on my desk.

The main estimators currently included in the code are generalized least squares, ordinary least squares, weighted least squares, autoregressive AR(p), generalized linear models (with several available distribution families and corresponding link functions), robust linear models, general additive models, and mixed effects models. The test coverage is starting to look pretty good, then there is just squashing the few remaining bugs and improving the postestimation statistics.

Some enhancements have also been made to the code. I have started to include some public domain or appropriately copyrighted datasets for testing purposes that could also be useful for examples and tutorials, so that every usage example doesn't have to start with generating your own random data. I have followed pretty closely to the datasets proposal in the Scikits Learn package.

We have also decided to break from the formula framework that is used in NiPy. It was in flux (being changed to take advantage of SymPy the last I heard) and is intended to be somewhat similar to the formula framework in R. In its place for now, I have written some convenience functions to append a constant to a design matrix or to handle categorical variables for estimation. For the moment, a typical model/estimator is used as

In [1]: from models.regression import OLS

In [2]: from models.datasets.longley.data import load

In [3]: from models.functions import add_constant

In [6]: model = OLS(data.endog, data.exog)

In [7]: results = model.fit()

In [8]: results.params
Out[8]:
array([ 1.50618723e+01, -3.58191793e-02, -2.02022980e+00,
-1.03322687e+00, -5.11041057e-02, 1.82915146e+03,
-3.48225863e+06])

Barring any unforeseen difficulties, the models code should be available as a standalone package shortly after the midterm evaluation rapidly approaching in ten days. The second half of the summer will then be focused on optimizing the code, finalizing design issues, extending the models, and writing good documentation and tutorials so that the code can be included in SciPy!