Monday, August 31, 2009

scikits.statsmodels Release Announcement

We have been working hard to get a release ready for general consumption for the statsmodels code. Well, we're happy to announce that a (very) beta release is ready.


The statsmodels code was started by Jonathan Taylor and was formerly included as part of scipy. It was taken up to be tested, corrected, and extended as part of the Google Summer of Code 2009.

What it is

We are now releasing the efforts of the last few months under the scikits namespace as scikits.statsmodels. Statsmodels is a pure python package that requires numpy and scipy. It offers a convenient interface for fitting parameterized statistical models with growing support for displaying univariate and multivariate summary statistics, regression summaries, and (postestimation) statistical tests.

Main Feautures

* regression: Generalized least squares (including weighted least squares and least squares with autoregressive errors), ordinary least squares.
* glm: Generalized linear models with support for all of the one-parameter exponential family distributions.
* rlm: Robust linear models with support for several M-estimators.
* datasets: Datasets to be distributed and used for examples and in testing.

There is also a sandbox which contains code for generalized additive models (untested), mixed effects models, cox proportional hazards model (both are untested and still dependent on the nipy formula framework), generating descriptive statistics, and printing table output to ascii, latex, and html. None of this code is considered "production ready".

Where to get it

Development branches will be on LaunchPad. This is where to go to get the most up to date code in the trunk branch. Experimental code will also be hosted here in different branches.

Source download of stable tags will be on SourceForge.




Simplified BSD


The official documentation is hosted on SourceForge.

The sphinx docs are currently undergoing a lot of work. They are not yet comprehensive, but should get you started.

This blog will continue to be updated as we make progress on the code.

Discussion and Development

All chatter will take place on the or scipy-user mailing list. We are very interested in receiving feedback about usability, suggestions for improvements, and bug reports via the mailing list or the bug tracker at


  1. As a social scientist, econometrician, and wannabe programmer, I'm hoping Python will eventually supplant R as the predominant language for econometric and statistical analysis. I detest R's syntax, but end up using it anyway since R has a package for almost any econometric issue I confront. So to me, the development of the statsmodels project and the other Python tools is very exciting. Stata is by far the predominant statistical tool in economics because it is oriented specifically toward econometrics and has simple Mata commands that cover the bulk of the routine stuff we do and its "do-file" feature makes work easily replicable. Yet Stata is expensive, its code is proprietary, and its graphical power remains depressingly low despite some improvement in recent versions. The consensus among many economists seems to be that our ideal workflow would retain much of Stata's simplicity but include enhanced integration with Python for graphics in matplotlib and for other tasks like data manipulation and parsing, which are clumsy to program in Stata's Mata language. Many economists currently use Matlab for these functions to compensate for Stata's weaknesses, but Matlab is prohibitively expensive and not demonstrably superior to what one can accomplish with Python. Python seems like a better language than the other tool social scientists turn to for these functions, R--the Python language is just as powerful but is simpler and cleaner--yet statisticians and economists continue to work in R rather than develop Python solutions. Strikes me as odd.

    Out of curiosity, where do you see Python--statsmodels/SciPy/NumPy/etc--heading over the next few years in terms of econometrics and statistics? Here's hoping that Python supplants R as the premier language for statistical, econometric, and general scientific analysis.

    Thanks again for your great work on this project.


  2. I just wrote a long comment that got eaten by blogger, so I'll try to recap briefly.

    I agree with you 100%. I use mainly Python, Stata, and Matlab. I haven't found anything in Matlab that I can't do in Python (except having bindings for Dynare, which I understand are on the way). I find Mata to be a poor man's C with some convenience functions.

    As for the future, I think we are poised to make great strides on the statistics capabilities of SciPy. For statsmodels specifically, after this summer's Google Summer of Code, we should have a good bit of the "standard" econometricians library. There is also some work going on for a general symbolic formula framework, similar to R's, more for the statisticians and experimental guys to make designs, contrasts, factors, levels, etc. If you want to follow more closely the discussion, I encourage you to join our mailing list at the pystatsmodels google group here. There is also some chatter on the scipy-user and scipy-dev mailing lists about the future of SciPy and stats and we up to have some good conversations at the SciPy conference in a few weeks.

  3. Very cool--I'm looking forward to the project's development. I'll lurk around the pystatsmodels google group for updates. Thanks for the great work on this.