Friday, June 11, 2010

Statsmodels: GSoC Week 3 Update

[Edit: Formatting should be fixed now. I will not be reformatting old posts though, so that they don't get reposted at Planet SciPy]



Last week was spent mainly ensuring that I pass my comps and remain a PhD student. This week was much more productive for coding. For now, all changes are in my branch and have not been merged to trunk, but I will describe the two big changes.



The first concerns the datasets package. This one is not all that exciting, but suffice it to say that the datasets are now streamlined and use the Bunch pattern to load the data. Thanks, Gaël, for pointing this out. I also rewrote a bit of David's datasets proposal from scikits-learn to reflect the current design of our datasets and thoughts. You can see it here (soon to be on the docs page). We are making an effort to ensure that our datasets are going to be similar to those of scikits-learn.



The second change was an improvement of the fitting of maximum likelihood models and the start of a GenericLikelihoodModel class. Maximum likelihood based models (mainly discrete choice models in the main code base right now) can now be fit using any of the unconstrained solvers from scipy.optimize (Nelder-Mead, BFGS, CG, Newton-CG, Powell) plus Newton-Raphson. To take a simple example to see how it works, we can fit a Probit model.



In [1]: import scikits.statsmodels as sm

In [2]: data = sm.datasets.spector.load()

In [3]: data.exog = sm.add_constant(data.exog)

In [4]: res_newton = sm.Probit(data.endog, data.exog).fit(method="newton")
Optimization terminated successfully.
Current function value: 12.818804
Iterations 6

In [5]: res_nm = sm.Probit(data.endog, data.exog).fit(method="nm", maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [6]: res_bfgs = sm.Probit(data.endog, data.exog).fit(method="bfgs")
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 15
Function evaluations: 21
Gradient evaluations: 21

In [7]: res_cg = sm.Probit(data.endog, data.exog).fit(method="cg", maxiter=250)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 188
Function evaluations: 428
Gradient evaluations: 428

In [8]: res_ncg = sm.Probit(data.endog, data.exog).fit(method="ncg", avextol=1e-8)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 12
Function evaluations: 14
Gradient evaluations: 12
Hessian evaluations: 12

In [9]: res_powell = sm.Probit(data.endog, data.exog).fit(method="powell", ftol=1e-8)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 12
Function evaluations: 568




All of the options for the solvers are available and are documented in the fit method. As you can see, some of the default values need to be changed to ensure (accurate) convergence. The Results objects that are returned have two new attributes.




In [10]: res_powell.mle_retvals
Out[10]:
{'converged': True,
'direc': array([[ 7.06629660e-02, -3.07499922e-03, 5.38418734e-01,
-4.19910465e-01],
[ 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
0.00000000e+00],
[ 1.49194876e+00, -6.64992809e-02, -6.96792443e-03,
-3.22306873e+00],
[ -5.36227277e-02, 1.18544093e-01, -8.75205765e-02,
-2.42149981e+00]]),
'fcalls': 568,
'fopt': 12.818804069990534,
'iterations': 12,
'warnflag': 0}

In [11]: res_powell.mle_settings
Out[11]:
{'callback': None,
'disp': 1,
'fargs': (),
'ftol': 1e-08,
'full_output': 1,
'maxfun': None,
'maxiter': 35,
'optimizer': 'powell',
'retall': 0,
'start_direc': None,
'start_params': [0, 0, 0, 0],
'xtol': 0.0001}

The dict mle_retvals contains all of the values that are returned from the solver if the full_output keyword is True. The dict mle_settings contains all of the arguments passed to the solver, including the defaults so that these can be checked after the fit. Again, all settings and returned values are documented in the fit method and in the results class, respectively.



Lastly, I started a GenericLikelihoodModel class. This is currently unfinished, though the basic idea is laid out. Take again the Probit example above using Lee Spector's educational program data. And assume we didn't have the Probit model from statsmodels. We could use the new GenericLikelihoodModel class. There are two ways (probably more) to proceed. For those comfortable with object oriented programming and inheritance in Python, we could subclass the GenericLikelihoodModel, defining our log-likelihood method.



from scikits.statsmodels import GenericLikelihoodModel as LLM
from scipy import stats
import numpy as np

class MyModel(LLM):
def loglike(self, params):
"""
Probit log-likelihood
"""
q = 2*self.endog - 1
X = self.exog
return np.add.reduce(stats.norm.logcdf(q*np.dot(X,params)))




Now this model could be fit, using any of the methods that only require an objective function, i.e., Nelder-Mead or Powell.



In [43]: mod = MyModel(data.endog, data.exog)

In [44]: res = mod.fit(method="nm", maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [45]: res_nm.params
Out[45]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])

In [46]: res.params
Out[46]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])




The main drawback right now is that all statistics that rely on the covariance of the parameters, etc. will use numeric gradients and Hessians, which can lessen that accuracy of those statistics. This can be overcome by providing score and hessian methods as loglike was provided above. Of course, for more complicated likelihood functions this can soon become cumbersome. We are working towards more accurate numerical differentiation and discussing options for automatic or symbolic differentiation.



The main advantage as opposed to just writing your likelihood and passing it to a solver is that you have all of the (growing number of) statistics and tests available to statsmodels right in the generic model.



I would also like to accommodate those who are less familiar with OOP and inheritance in Python. I haven't quite worked out the final design for how this would go yet. Right now, you could do the following, though I don't think it quite meets the less complicated goal.


In [4]: from scikits.statsmodels.model import GenericLikelihoodModel as LLM

In [5]: import scikits.statsmodels as sm

In [6]: from scipy import stats

In [7]: import numpy as np

In [8]:

In [9]: data = sm.datasets.spector.load()

In [10]: data.exog = sm.add_constant(data.exog)

In [11]:

In [12]: def probitloglike(params, endog, exog):
....: """
....: Log likelihood for the probit
....: """
....: q = 2*endog - 1
....: X = exog
....: return np.add.reduce(stats.norm.logcdf(q*np.dot(X,params)))
....:

In [13]: mod = LLM(data.endog, data.exog, loglike=probitloglike)

In [14]: res = mod.fit(method="nm", fargs=(data.endog,data.exog), maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [15]: res.params
Out[15]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])




There are still a few design issues and bugs that need to be worked out with the last example, but the basic idea is there. That's all for now.

4 comments:

  1. Hey Skipper, great job!

    On the bunch patter, current versions of python have something called named tuples:
    http://docs.python.org/library/collections.html#collections.namedtuple

    This class is a 'high-performance' implementation of the bunch pattern.

    Keep up the good job

    Demian

    ReplyDelete
  2. Thanks for the support.

    We are still supporting Python 2.5 for now, but I will definitely keep this on the radar for when we switch to Python 2.6 and especially Python >= 3.0.

    ReplyDelete
  3. Hi Skipper,

    Just a style comment: you seem to be mixing underscore-separated words, and attached words. This seems slightly inconsistent. I personally prefer to systematically separate words by underscore. I find it easier to read.

    ReplyDelete
  4. Any in particular you find hard to read? I try to use the underscore only if its not readable when you put them together (to avoid too much bouncing on the shift key), but I understand that letters that blend together for me are not always the same for other people.

    From Pep 8:

    "Function Names

    Function names should be lowercase, with words separated by underscores
    as necessary to improve readability."

    ReplyDelete