Scipy Stats Project: 2010

Tuesday, June 22, 2010

Statsmodels: GSoC Week 4 Update

I spent the last week finishing up the paper that I submitted to accompany my talk at the SciPy conference. I am really looking forward to going to Austin and hearing all the great talks (plus I hear the beer is cheap and the food and music are good, which doesn't hurt). In addition to finishing up the paper, I have started to clean up our time series code.

So far this has included finishing the augmented Dickey-Fuller (ADF) test for unit roots. The big time sink here is that the ADF test-statistic has a non-standard distribution in most cases. The ADF test statistic is obtained by running the following regression

$\Delta y_{t} = \alpha+\beta t+\gamma y_{t-1}+\delta_{1}\Delta y_{t-1} + \cdots +\delta_{p}\Delta y_{t-p}$

One approach to testing for a unit root means testing the t-stat on the coefficient on the lagged level of y. The actual distribution for this statistic, however, is not Student's t. Many software packages use the tables in Fuller (1976, updated to 1996 version or not) in order to get the critical values for the test statistic depending on the sample size. They use linear interpolation for sample sizes not included in the table. The p-values for the obtained test statistic are usually obtained using MacKinnon's (1994) study that estimated regression surfaces of these distributions via Monte Carlo simulation.

While we do use MacKinnon's approximate p-values from the 1994 paper, MacKinnon wrote a note updating this paper in early 2010, which gives new regression surface results for obtaining the critical values. We use these new results for the critical values. Therefore, when using our ADF test, it is advised that if the p-value is close to the reject/accept region then the critical values should be used in place of the p-value to make the ultimate decision.

We can illlustrate the use of ADF. Note that this version is only in my branch and that it is still in the sandbox, even though it has now been tested, because the API and returned results may change. We will demonstrate on a series that we can easily guess is non-stationary, real GDP.

In [1]: import scikits.statsmodels as sm

In [2]: from scikits.statsmodels.sandbox.tsa.stattools import adfuller

In [3]: data = sm.datasets.macrodata.load()

In [4]: realgdp = data.data['realgdp']

In [5]: adf = adfuller(realgdp, maxlag=4, autolag=None, regression="ct")

In [6]: adf
Out[6]: 
(-1.8566384063254346,
 0.67682917510440099,
 4,
 198,
 {'1%': -4.0052351400496136,
  '10%': -3.1402115863254525,
  '5%': -3.4329000694218998})

The return values are the test statistic, its p-value (the null-hypothesis here is that the series does contain a unit root), the number of lags of the differences used, the number of observations for the regression, and a dictionary containing the critical values at the respective confidence levels. The regression option controls the type of regression (ie., whether to include a constant or a linear or quadratic time trend), and the autolag option has three options for choosing the lag length to help correct for serial correlation in the regression. There are 'AIC', 'BIC', and 't-stat'. The former two choose the lag length that maximizes the infofrmation criterion, the latter chooses the lag length based on the significance of the lag. This starts with maxlag and works its way down. The docstring has more detailed information.

Beyond this, I have been working on an autocorrelation function (acf), a partial autocorrelation function (pacf), and Q-Statistics (Box-Ljung test). Next up for this week is finishing my VAR class with identification schemes. After this, I will work to integrate post-estimation tests into our results classes, most likely using some sort of mix-in classes and attach test containers to the results objects for test results. Then it's off to the SciPy conference. There I will hopefully be participating in the stats sprint, helping out with the docs marathon and discussing what we need for the future of statistics and Python.

Fuller, W.A.  1996.  Introduction to Statistical Time Series. 2nd ed.  Wiley.
 

MacKinnon, J.G. 1994.  "Approximate asymptotic distribution functions for 
    unit-root and cointegration tests.  Journal of Business and Economic
    Statistics 12, 167-76.

MacKinnon, J.G. 2010. "Critical Values for Cointegration Tests."  
    Queen's University, Dept of Economics, Working Papers.  Available at
    http://ideas.repec.org/p/qed/wpaper/1227.html

Friday, June 11, 2010

Statsmodels: GSoC Week 3 Update

[Edit: Formatting should be fixed now. I will not be reformatting old posts though, so that they don't get reposted at Planet SciPy]

Last week was spent mainly ensuring that I pass my comps and remain a PhD student. This week was much more productive for coding. For now, all changes are in my branch and have not been merged to trunk, but I will describe the two big changes.

The first concerns the datasets package. This one is not all that exciting, but suffice it to say that the datasets are now streamlined and use the Bunch pattern to load the data. Thanks, Gaël, for pointing this out. I also rewrote a bit of David's datasets proposal from scikits-learn to reflect the current design of our datasets and thoughts. You can see it here (soon to be on the docs page). We are making an effort to ensure that our datasets are going to be similar to those of scikits-learn.

The second change was an improvement of the fitting of maximum likelihood models and the start of a GenericLikelihoodModel class. Maximum likelihood based models (mainly discrete choice models in the main code base right now) can now be fit using any of the unconstrained solvers from scipy.optimize (Nelder-Mead, BFGS, CG, Newton-CG, Powell) plus Newton-Raphson. To take a simple example to see how it works, we can fit a Probit model.

In [1]: import scikits.statsmodels as sm

In [2]: data = sm.datasets.spector.load()

In [3]: data.exog = sm.add_constant(data.exog)

In [4]: res_newton = sm.Probit(data.endog, data.exog).fit(method="newton")
Optimization terminated successfully.
Current function value: 12.818804
Iterations 6

In [5]: res_nm = sm.Probit(data.endog, data.exog).fit(method="nm", maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [6]: res_bfgs = sm.Probit(data.endog, data.exog).fit(method="bfgs")
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 15
Function evaluations: 21
Gradient evaluations: 21

In [7]: res_cg = sm.Probit(data.endog, data.exog).fit(method="cg", maxiter=250)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 188
Function evaluations: 428
Gradient evaluations: 428

In [8]: res_ncg = sm.Probit(data.endog, data.exog).fit(method="ncg", avextol=1e-8)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 12
Function evaluations: 14
Gradient evaluations: 12
Hessian evaluations: 12

In [9]: res_powell = sm.Probit(data.endog, data.exog).fit(method="powell", ftol=1e-8)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 12
Function evaluations: 568

All of the options for the solvers are available and are documented in the fit method. As you can see, some of the default values need to be changed to ensure (accurate) convergence. The Results objects that are returned have two new attributes.

In [10]: res_powell.mle_retvals
Out[10]:
{'converged': True,
'direc': array([[ 7.06629660e-02, -3.07499922e-03, 5.38418734e-01,
-4.19910465e-01],
[ 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
0.00000000e+00],
[ 1.49194876e+00, -6.64992809e-02, -6.96792443e-03,
-3.22306873e+00],
[ -5.36227277e-02, 1.18544093e-01, -8.75205765e-02,
-2.42149981e+00]]),
'fcalls': 568,
'fopt': 12.818804069990534,
'iterations': 12,
'warnflag': 0}

In [11]: res_powell.mle_settings
Out[11]:
{'callback': None,
'disp': 1,
'fargs': (),
'ftol': 1e-08,
'full_output': 1,
'maxfun': None,
'maxiter': 35,
'optimizer': 'powell',
'retall': 0,
'start_direc': None,
'start_params': [0, 0, 0, 0],
'xtol': 0.0001}

The dict mle_retvals contains all of the values that are returned from the solver if the full_output keyword is True. The dict mle_settings contains all of the arguments passed to the solver, including the defaults so that these can be checked after the fit. Again, all settings and returned values are documented in the fit method and in the results class, respectively.

Lastly, I started a GenericLikelihoodModel class. This is currently unfinished, though the basic idea is laid out. Take again the Probit example above using Lee Spector's educational program data. And assume we didn't have the Probit model from statsmodels. We could use the new GenericLikelihoodModel class. There are two ways (probably more) to proceed. For those comfortable with object oriented programming and inheritance in Python, we could subclass the GenericLikelihoodModel, defining our log-likelihood method.

from scikits.statsmodels import GenericLikelihoodModel as LLM
from scipy import stats
import numpy as np

class MyModel(LLM):
def loglike(self, params):
"""
Probit log-likelihood
"""
q = 2*self.endog - 1
X = self.exog
return np.add.reduce(stats.norm.logcdf(q*np.dot(X,params)))

Now this model could be fit, using any of the methods that only require an objective function, i.e., Nelder-Mead or Powell.

In [43]: mod = MyModel(data.endog, data.exog)

In [44]: res = mod.fit(method="nm", maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [45]: res_nm.params
Out[45]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])

In [46]: res.params
Out[46]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])

The main drawback right now is that all statistics that rely on the covariance of the parameters, etc. will use numeric gradients and Hessians, which can lessen that accuracy of those statistics. This can be overcome by providing score and hessian methods as loglike was provided above. Of course, for more complicated likelihood functions this can soon become cumbersome. We are working towards more accurate numerical differentiation and discussing options for automatic or symbolic differentiation.

The main advantage as opposed to just writing your likelihood and passing it to a solver is that you have all of the (growing number of) statistics and tests available to statsmodels right in the generic model.

I would also like to accommodate those who are less familiar with OOP and inheritance in Python. I haven't quite worked out the final design for how this would go yet. Right now, you could do the following, though I don't think it quite meets the less complicated goal.

In [4]: from scikits.statsmodels.model import GenericLikelihoodModel as LLM

In [5]: import scikits.statsmodels as sm

In [6]: from scipy import stats

In [7]: import numpy as np

In [8]:

In [9]: data = sm.datasets.spector.load()

In [10]: data.exog = sm.add_constant(data.exog)

In [11]:

In [12]: def probitloglike(params, endog, exog):
....: """
....: Log likelihood for the probit
....: """
....: q = 2*endog - 1
....: X = exog
....: return np.add.reduce(stats.norm.logcdf(q*np.dot(X,params)))
....:

In [13]: mod = LLM(data.endog, data.exog, loglike=probitloglike)

In [14]: res = mod.fit(method="nm", fargs=(data.endog,data.exog), maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [15]: res.params
Out[15]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])

There are still a few design issues and bugs that need to be worked out with the last example, but the basic idea is there. That's all for now.

Monday, May 31, 2010

Week 1 GSoC Update

Last week was the first of the Google Summer of Code. I spent most of the week in a Bayesian econometrics class led by John Geweke and studying for a comprehensive exam that I take this week, so progress on statsmodels was rather slow. That said, I have been able to take care of some low hanging fruit.

There are a few name changes:

statsmodels/family -> statsmodels/families
statsmodels/lib/io.py -> statsmodels/iolib/foreign.py

Also Vincent has done a good bit of work on improving our output using the SimpleTable class from econpy. I will post some examples over the coming weeks, but SimpleTable provides an easy way to make tables in ASCII text, HTML, or LaTeX. The SimpleTable class has been moved

statsmodels/sandbox/ouput.py -> statsmodels/iolib/table.py

Beyond the renames, I have removed the soft dependency on RPy for running our tests in favor of hard-coded results, refactored our tests, and added a few additional ones along the way.

We are also making an effort to keep our online documentation synced with the current trunk. The biggest change to our documentation is the addition of a developer's page for those who might like to get involved. As always, please report problems with the docs on either the scipy-user list or join in the discussions of statsmodels, pandas, larry, and other topics on statistics and Python at the pystatsmodels Google group.

Saturday, May 1, 2010

Plans for the Summer

A quick update on the plans for statsmodels over the next few months.

I have been accepted for my second Google Summer of Code, which means that we will have a chance to make a big push to get a lot of our work out of the sandbox, tested, and included in the main code base.

You can see the roadmap on Google's GSoC site here. You might have to log in to view it.

The quick version follows. As far as general issues, I will be getting the code ready for Python 3 and focusing on some design issues including an improved generic maximum likelihood framework, post-estimation testing, variable name handling, and output in text tables, LaTeX, and html. I will then be working to get a lot of our code out of the sandbox. This includes timeseries convenience functions and models such as GARCH, VARMA, Hodrick-Prescott filter, and a state space model that uses the Kalman filter. I will be polishing the systems of equation framework and panel (longitudinal) data estimators. We have also been working on some nonparametric estimators including univariate kernel density estimators and kernel regression estimators. Finally, as part of my coursework I have been working toward (generalized) maximum entropy models that I hope to include as well as some work on the scipy.maxentropy module.

I will give a quick talk on the project for the SciPy Conference in Austin.

It looks like we are set to make a good deal of progress on the code this summer.

Monday, March 8, 2010

Sparse Least Squares Solver

I have a homework doing some monte carlo experiments of an autoregressive process of order 1, and I thought I would use it as an example to demonstrate the sparse least squares solver that Stefan committed to scipy revision 6251.

All mistakes are mine...

Given an AR(1) process

$y_{t}=.9y_{t-1} + .01\epsilon_{t} \newline \newline \text{where} \newline \newline \epsilon_{t} \sim \boldsymbol{\text{NID} }\left(0,1\right)$

We can estimate the following autoregressive coefficients.

In [1]: import numpy as np

In [2]: np.set_printoptions(threshold=25)

In [3]: np.random.seed(1)

In [4]: # make 1000 autoregressive series of length 100

In [5]: # y_0 = 0 by assumption

In [6]: samples = np.zeros((100,1000))

In [7]: for i in range(1,100):
...: error = np.random.randn(1,1000)
...: samples[i] = .9 * samples[i-1] + .01 * error
...:

In [8]: from scipy import sparse

In [9]: from scipy.sparse import linalg as spla

In [10]: # make block diagonal sparse matrix of y_t-1

In [11]: # recommended to build as a linked list

In [12]: spX = sparse.lil_matrix((99*1000,1000))

In [13]: for i in range(1000):
....: spX[i*99:(i+1)*99,i] = samples[:-1,i][:,None]
....:

In [14]: spX = spX.tocsr() # convert it to csr for performance

In [15]: # do the least squares

In [16]: retval = spla.isolve.lsqr(spX, samples[1:,:].ravel('F'), calc_var=True)
In [17]: retval[0]
Out[17]:
array([ 0.88347438, 0.8474124 , 0.85282674, ..., 0.91019165,
0.89698465, 0.76895806])

I'm curious if there's any downside to using sparse least squares whenever the RHS of a least squares can be written in block diagonal form.

Tuesday, February 16, 2010

scikits.statsmodels 0.2.0 release

While I find no time to blog, I thought I'd post our newest release announcement here.

We are happy to announce the 0.2.0 (beta) release of scikits.statsmodels. This is both a bug-fix and new feature release.

Download
------------

You can easy_install (or PyPI URL:
http://pypi.python.org/pypi/scikits.statsmodels/)

Source downloads: http://sourceforge.net/projects/statsmodels/

Development branches: http://code.launchpad.net/statsmodels

Note that the trunk branch on launchpad is almost always stable and has the most up to date changes since our releases are so few and far between.

Documentation
------------------

http://statsmodels.sourceforge.net/

We invite you to install, kick the tires, and make bug reports and feature requests.

Feedback can either be on scipy-user or the mailing list at
http://groups.google.com/group/pystatsmodels?hl=en
Bug tracker: https://bugs.launchpad.net/statsmodels

Main Changes in 0.2.0
---------------------------------

* Improved documentation and expanded and more examples
* Added four discrete choice models: Poisson, Probit, Logit, and Multinomial Logit.
* Added PyDTA. Tools for reading Stata binary datasets (*.dta) and putting them into numpy arrays.
* Added four new datasets for examples and tests.
* Results classes have been refactored to use lazy evaluation.
* Improved support for maximum likelihood estimation.
* bugfixes
* renames for more consistency
RLM.fitted_values -> RLM.fittedvalues
GLMResults.resid_dev -> GLMResults.resid_deviance

Sandbox
-------------

We are continuing to work on support for systems of equations models, panel data models, time series analysis, and information and entropy econometrics in the sandbox. This code is often merged into trunk as it becomes more robust.

Scipy Stats Project