Tuesday, June 22, 2010

Statsmodels: GSoC Week 4 Update

I spent the last week finishing up the paper that I submitted to accompany my talk at the SciPy conference. I am really looking forward to going to Austin and hearing all the great talks (plus I hear the beer is cheap and the food and music are good, which doesn't hurt). In addition to finishing up the paper, I have started to clean up our time series code.

So far this has included finishing the augmented Dickey-Fuller (ADF) test for unit roots. The big time sink here is that the ADF test-statistic has a non-standard distribution in most cases.  The ADF test statistic is obtained by running the following regression



One approach to testing for a unit root means testing the t-stat on the coefficient on the lagged level of y.  The actual distribution for this statistic, however, is not Student's t.  Many software packages use the tables in Fuller (1976, updated to 1996 version or not) in order to get the critical values for the test statistic depending on the sample size.  They use linear interpolation for sample sizes not included in the table.  The p-values for the obtained test statistic are usually obtained using MacKinnon's (1994) study that estimated regression surfaces of these distributions via Monte Carlo simulation.

While we do use MacKinnon's approximate p-values from the 1994 paper, MacKinnon wrote a note updating this paper in early 2010, which gives new regression surface results for obtaining the critical values.  We use these new results for the critical values.  Therefore, when using our ADF test, it is advised that if the p-value is close to the reject/accept region then the critical values should be used in place of the p-value to make the ultimate decision.

We can illlustrate the use of ADF.  Note that this version is only in my branch and that it is still in the sandbox, even though it has now been tested, because the API and returned results may change.  We will demonstrate on a series that we can easily guess is non-stationary, real GDP.


In [1]: import scikits.statsmodels as sm In [2]: from scikits.statsmodels.sandbox.tsa.stattools import adfuller In [3]: data = sm.datasets.macrodata.load() In [4]: realgdp = data.data['realgdp'] In [5]: adf = adfuller(realgdp, maxlag=4, autolag=None, regression="ct") In [6]: adf Out[6]: (-1.8566384063254346,  0.67682917510440099,  4,  198,  {'1%': -4.0052351400496136,   '10%': -3.1402115863254525,   '5%': -3.4329000694218998})


The return values are the test statistic, its p-value (the null-hypothesis here is that the series does contain a unit root), the number of lags of the differences used, the number of observations for the regression, and a dictionary containing the critical values at the respective confidence levels.  The regression option controls the type of regression (ie., whether to include a constant or a linear or quadratic time trend), and the autolag option has three options for choosing the lag length to help correct for serial correlation in the regression.  There are 'AIC', 'BIC', and 't-stat'.  The former two choose the lag length that maximizes the infofrmation criterion, the latter chooses the lag length based on the significance of the lag.  This starts with maxlag and works its way down.  The docstring has more detailed information.




Beyond this, I have been working on an autocorrelation function (acf), a  partial autocorrelation function (pacf), and Q-Statistics (Box-Ljung test). Next up for this week is finishing my VAR class with identification schemes.  After this, I will work to integrate post-estimation tests into our results classes, most likely using some sort of mix-in classes and attach test containers to the results objects for test results.  Then it's off to the SciPy conference. There I will hopefully be participating in the stats sprint, helping out with the docs marathon and discussing what we need for the future of statistics and Python.


Fuller, W.A.  1996.  Introduction to Statistical Time Series. 2nd ed.  Wiley. MacKinnon, J.G. 1994.  "Approximate asymptotic distribution functions for     unit-root and cointegration tests.  Journal of Business and Economic     Statistics 12, 167-76. MacKinnon, J.G. 2010. "Critical Values for Cointegration Tests."      Queen's University, Dept of Economics, Working Papers.  Available at     http://ideas.repec.org/p/qed/wpaper/1227.html

Friday, June 11, 2010

Statsmodels: GSoC Week 3 Update

[Edit: Formatting should be fixed now. I will not be reformatting old posts though, so that they don't get reposted at Planet SciPy]



Last week was spent mainly ensuring that I pass my comps and remain a PhD student. This week was much more productive for coding. For now, all changes are in my branch and have not been merged to trunk, but I will describe the two big changes.



The first concerns the datasets package. This one is not all that exciting, but suffice it to say that the datasets are now streamlined and use the Bunch pattern to load the data. Thanks, Gaƫl, for pointing this out. I also rewrote a bit of David's datasets proposal from scikits-learn to reflect the current design of our datasets and thoughts. You can see it here (soon to be on the docs page). We are making an effort to ensure that our datasets are going to be similar to those of scikits-learn.



The second change was an improvement of the fitting of maximum likelihood models and the start of a GenericLikelihoodModel class. Maximum likelihood based models (mainly discrete choice models in the main code base right now) can now be fit using any of the unconstrained solvers from scipy.optimize (Nelder-Mead, BFGS, CG, Newton-CG, Powell) plus Newton-Raphson. To take a simple example to see how it works, we can fit a Probit model.



In [1]: import scikits.statsmodels as sm

In [2]: data = sm.datasets.spector.load()

In [3]: data.exog = sm.add_constant(data.exog)

In [4]: res_newton = sm.Probit(data.endog, data.exog).fit(method="newton")
Optimization terminated successfully.
Current function value: 12.818804
Iterations 6

In [5]: res_nm = sm.Probit(data.endog, data.exog).fit(method="nm", maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [6]: res_bfgs = sm.Probit(data.endog, data.exog).fit(method="bfgs")
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 15
Function evaluations: 21
Gradient evaluations: 21

In [7]: res_cg = sm.Probit(data.endog, data.exog).fit(method="cg", maxiter=250)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 188
Function evaluations: 428
Gradient evaluations: 428

In [8]: res_ncg = sm.Probit(data.endog, data.exog).fit(method="ncg", avextol=1e-8)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 12
Function evaluations: 14
Gradient evaluations: 12
Hessian evaluations: 12

In [9]: res_powell = sm.Probit(data.endog, data.exog).fit(method="powell", ftol=1e-8)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 12
Function evaluations: 568




All of the options for the solvers are available and are documented in the fit method. As you can see, some of the default values need to be changed to ensure (accurate) convergence. The Results objects that are returned have two new attributes.




In [10]: res_powell.mle_retvals
Out[10]:
{'converged': True,
'direc': array([[ 7.06629660e-02, -3.07499922e-03, 5.38418734e-01,
-4.19910465e-01],
[ 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
0.00000000e+00],
[ 1.49194876e+00, -6.64992809e-02, -6.96792443e-03,
-3.22306873e+00],
[ -5.36227277e-02, 1.18544093e-01, -8.75205765e-02,
-2.42149981e+00]]),
'fcalls': 568,
'fopt': 12.818804069990534,
'iterations': 12,
'warnflag': 0}

In [11]: res_powell.mle_settings
Out[11]:
{'callback': None,
'disp': 1,
'fargs': (),
'ftol': 1e-08,
'full_output': 1,
'maxfun': None,
'maxiter': 35,
'optimizer': 'powell',
'retall': 0,
'start_direc': None,
'start_params': [0, 0, 0, 0],
'xtol': 0.0001}

The dict mle_retvals contains all of the values that are returned from the solver if the full_output keyword is True. The dict mle_settings contains all of the arguments passed to the solver, including the defaults so that these can be checked after the fit. Again, all settings and returned values are documented in the fit method and in the results class, respectively.



Lastly, I started a GenericLikelihoodModel class. This is currently unfinished, though the basic idea is laid out. Take again the Probit example above using Lee Spector's educational program data. And assume we didn't have the Probit model from statsmodels. We could use the new GenericLikelihoodModel class. There are two ways (probably more) to proceed. For those comfortable with object oriented programming and inheritance in Python, we could subclass the GenericLikelihoodModel, defining our log-likelihood method.



from scikits.statsmodels import GenericLikelihoodModel as LLM
from scipy import stats
import numpy as np

class MyModel(LLM):
def loglike(self, params):
"""
Probit log-likelihood
"""
q = 2*self.endog - 1
X = self.exog
return np.add.reduce(stats.norm.logcdf(q*np.dot(X,params)))




Now this model could be fit, using any of the methods that only require an objective function, i.e., Nelder-Mead or Powell.



In [43]: mod = MyModel(data.endog, data.exog)

In [44]: res = mod.fit(method="nm", maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [45]: res_nm.params
Out[45]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])

In [46]: res.params
Out[46]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])




The main drawback right now is that all statistics that rely on the covariance of the parameters, etc. will use numeric gradients and Hessians, which can lessen that accuracy of those statistics. This can be overcome by providing score and hessian methods as loglike was provided above. Of course, for more complicated likelihood functions this can soon become cumbersome. We are working towards more accurate numerical differentiation and discussing options for automatic or symbolic differentiation.



The main advantage as opposed to just writing your likelihood and passing it to a solver is that you have all of the (growing number of) statistics and tests available to statsmodels right in the generic model.



I would also like to accommodate those who are less familiar with OOP and inheritance in Python. I haven't quite worked out the final design for how this would go yet. Right now, you could do the following, though I don't think it quite meets the less complicated goal.


In [4]: from scikits.statsmodels.model import GenericLikelihoodModel as LLM

In [5]: import scikits.statsmodels as sm

In [6]: from scipy import stats

In [7]: import numpy as np

In [8]:

In [9]: data = sm.datasets.spector.load()

In [10]: data.exog = sm.add_constant(data.exog)

In [11]:

In [12]: def probitloglike(params, endog, exog):
....: """
....: Log likelihood for the probit
....: """
....: q = 2*endog - 1
....: X = exog
....: return np.add.reduce(stats.norm.logcdf(q*np.dot(X,params)))
....:

In [13]: mod = LLM(data.endog, data.exog, loglike=probitloglike)

In [14]: res = mod.fit(method="nm", fargs=(data.endog,data.exog), maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [15]: res.params
Out[15]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])




There are still a few design issues and bugs that need to be worked out with the last example, but the basic idea is there. That's all for now.