[Pkg-exppsy-maintainers] [SciPy-dev] Google Summer of Code and scipy.learn (another trying)

Mon Mar 24 14:32:31 UTC 2008

Hi Anton,

Thank you for the positive feedback. Let me introduce myself briefly: I
am one of the authors of PyMVPA and met Jarrod at NiPy coding sprint at
Paris. There we met with another group (of Jean B. Poline) which
develops analogous (yet closed source but due to our persuasion
hopefully open-source soon) toolbox. Although we had a lot of ideas in
common, some aspects were conceptually different (unfortunately I've
forgotten exact name of their toolbox: mindmine, or smth like that)

In the case of our PyMVPA we tried to build up a framework where it is possible
to combine various ML boxes together via parametrization of the classes in the
constructors. That could serve an equivalent role to those GUI-based frameworks
where you build your analysis from blocks by connecting them with "lines". And
the whole PyMVPA computation pipeline is initiated whenever resultant object
sees some data (like within train() of a classifier). It is also somewhat
similar to approach taken by MDP guys (http://mdp-toolkit.sourceforge.net/).
Also we have similar approach to existing scikits.learn to abstract all
relevant data within a Dataset class. The primary user-base we target
originally with PyMVPA is brainimaging research community, thus we had to
provide not only simple blocks but somewhat obvious and straightforward to use
software and a reasonable documentation. That could  cost us little loss
of generality, although I don't see it happening yet ;-)

In that other toolbox, they took the approach of creating a collection of
independent blocks which take just basic data structures (ie numpy arrays) as
their parameters -- 1 for data, 1 for labels if that is a fit() of a
classifier, and I think it might be more appropriate for the scikits.learn, so
that it simply provides all necessary building blocks for any other higher
level workflow creation and reuse, and there is no explicit need to decide
should we store sparse or dense arrays, etc. It is just that all blocks should
have unified interface within the classes of the same kind (and appropriate
modularization/hierarchy of cause).

Another aspect of scikits.learn to think about is either should it reimplement
existing ML tools (e.g. SVM, etc) or may be just provide unified interface to
their implementation elsewhere -- like SVMs from shogun. Probably it should do
both since some classifiers might be not yet freely available from the
libraries (like SMLR which we have within our PyMVPA). That would make it
possible to utilize many ML tools within a single point of entry (ie
scikits.learn).

And few words about PyMVPA: I think we have few cute implementation ideas
within our PyMVPA which could be borrowed/adapted within scikits.learn (if
PyMVPA is not to become an integral part of .learn ;-))

1. State variables: any classifier internally might store more than just
nuisance variables of a classifier, but some times that additional data is
costly either to store  or to compute. State variable, which looks outside just
like a property of an instance, can be enabled and then it actually can be
charged with the data. That is why PyMVPA Classifiers can store and expose lots
of additional information about internal processing which has already happened.
For instance, in recursive feature elimination, it might be interesting not
only to get an optimal feature set, but to see/analyze all feature
sensitivities per each step of the elimination. In PyMVPA's RFE, if state
'sensitivities' enabled, it just appends a new sensitivity each time, so after
RFE is done, it can be easily extracted from that object for further analysis.
But that is not a desired behavior by default since it might exhaust the memory
;-)

A little more about statefull objects:
http://pkg-exppsy.alioth.debian.org/pymvpa/manual.html#stateful-objects

2. Debug (or simply progress) information: for the quick assessment of
internal computational workflow, and seeing what takes the longest, etc,
it is critical imho to provide a mechanism to expose what is happening
inside of a classifier/feature_selection/etc while it being 'processed'.

In the case of PyMVPA we provide debug target (ID) for each interesting
class, thus debugging output can be spit out for anything of interest.
For example:

*$> MVPA_DEBUG=CLF,SVM MVPA_DEBUG_METRICS=asctime,reltime,vmem ./clfs_examples.py 2>&1
Haxby 8-cat subject 1: <Dataset / float32 864 x 530 uniq: 12 chunks 8 labels>
[CLF] DBG{ 'Mon Mar 24 10:16:05 2008' '0.000 sec' 'VmSize:\t  121164 kB'}: Training classifier SVMBase(kernel_type=0, C=-1.0, degree=3, weight_label=[], probability=0, shrinking=1, weight=[], eps=1e-05, svm_type=0, p=0.1, cache_size=100, nr_weight=0, coef0=0.0, nu=0.5, gamma=0.0, enable_states=['training_time', 'predicting_time', 'predictions', 'trained_labels']) on dataset <Dataset / float32 792 x 530 uniq: 11 chunks 8 labels>
[SVM] DBG{ 'Mon Mar 24 10:16:07 2008' '1.113 sec' 'VmSize:\t  131092 kB'}:   Default C computed to be 0.001509
[CLF] DBG{ 'Mon Mar 24 10:16:12 2008' '5.240 sec' 'VmSize:\t  128032 kB'}: Predicting classifier SVMBase(kernel_type=0, C=0.00150852096501, degree=3, weight_label=[], probability=0, shrinking=1, weight=[], eps=1e-05, svm_type=0, p=0.1, cache_size=100, nr_weight=0, coef0=0.0, nu=0.5, gamma=0.00188679245283, enable_states=['training_time', 'predicting_time', 'predictions', 'trained_labels']) on data (72, 530)
[CLF] DBG{ 'Mon Mar 24 10:16:12 2008' '0.581 sec' 'VmSize:\t  127880 kB'}: Training classifier SVMBase(kernel_type=0, C=0.00150852096501, degree=3, weight_label=[], probability=0, shrinking=1, weight=[], eps=1e-05, svm_type=0, p=0.1, cache_size=100, nr_weight=0, coef0=0.0, nu=0.5, gamma=0.00188679245283, enable_states=['training_time', 'predicting_time', 'predictions', 'trained_labels']) on dataset <Dataset / float32 792 x 530 uniq: 11 chunks 8 labels>
[SVM] DBG{ 'Mon Mar 24 10:16:13 2008' '1.112 sec' 'VmSize:\t  137680 kB'}:   Default C computed to be 0.001478

So we selected to output anything about classifiers, and SVM in particular. As
well as few useful metrics (so we could see how long it took and is there any
memory leak happening). And MVPA_DEBUG takes python regexps, so =.* would
enable *all* debugging output if that is necessary.

More about debug (as well as verbose and warning)
http://pkg-exppsy.alioth.debian.org/pymvpa/manual.html#progress-tracking

I hope that this could be interesting/useful for someone ;-)

Cheers and keep in touch!
Yarik

On Mon, 24 Mar 2008, Anton Slesarev wrote:

>    It really looks pretty good. I think that such functionality should be
>    in scipy(or scikits). I'll try to  take it into account in my proposal.

>    On Wed, Mar 19, 2008 at 4:43 AM, Jarrod Millman
>    <[1]millman at berkeley.edu> wrote:

>      Hey Anton,
>      Sorry I haven't responded sooner; I am at a coding sprint.  Anyway,
>      I
>      am meeting with the developers of Multivariate Pattern Analysis in
>      Python:  [2]http://pkg-exppsy.alioth.debian.org/pymvpa/
>      I think that this package looks very good.  David Cournapeau will be
>      visiting Berkeley in early April and I plan to discuss whether it
>      would make sense to merge this into scikits.learn.  So it would make
>      sense for you to take a look at the manual here:
>      [3]http://pkg-exppsy.alioth.debian.org/pymvpa/manual.html
>      I am really excited that you are planning to submit a proposal for
>      the SoC.
>      Thanks,
-- 
                                  .-.
=------------------------------   /v\  ----------------------------=
Keep in touch                    // \\     (yoh@|www.)onerussian.com
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                   Linux User    ^^-^^    [175555]