[Pkg-exppsy-maintainers] Q: Crossvalidation feature selection

Thu Dec 20 19:13:33 UTC 2007

Hi folks,

[sorry to be late]

On Thu, Dec 20, 2007 at 10:26:20AM -0500, Per B. Sederberg wrote:
> Howdy Yarik:
> 
> Thanks for the answers!  See my comments down below...
> 
> On Dec 20, 2007 10:00 AM, Yaroslav Halchenko <debian at onerussian.com> wrote:
<snip>
> > > 1) I'd like to run feature selection (I'm happy to use anything in
> > > there, such as the ANOVA or something fancier) on each training set of
> > > a N-Fold cross validation run.  I'd also like to save the mask of
> > > those features for later analysis.  Ideally, I'd like to specify a
> > > constant number of features (say 1000) to keep for each fold.
> > How do you know that you need 1000?
> >
> It's just an example, but one of the things I'm doing first is trying
> to replicate some basic classification results that I performed with
> the matlab mvpa.  For those analyses I found that approx 1000 voxels
> gave good classification, so I'd like to try to classify with
> something similar.  I totally agree that the ideal mechanism is to use
> some form of RFE.
> 
> > In any case, Michael  would correct me if I am wrong, by now we didn't
> > yet have a Classifier which would do some feature selection, ie you had
> > to implement loop through the splits manually and run RFE
> > FeatureSelection (using some SensitivityAnalyzer such as OnewayAnova or
> > LinearSVMWeights if you use SVM) on each split manually.
> >
> 
> OK, I can certainly do that loop and keep track of the results myself.
>  I just thought some version may already be there (and I see it may be
> soon :))
Yeah, we don't have that yet, but basically a CV with RFE and ANOVA comes
down to this ('data' is your dataset):

------------------

from mvpa.clfs.svm import LinearNuSVMC
from mvpa.datasets.splitter import NFoldSplitter
from mvpa.algorithms.rfe import RFE
from mvpa.algorithms.anova import OneWayAnova
from mvpa.clfs.transerror import TransferError
from mvpa.algorithms.featsel import \
     StopNBackHistoryCriterion, FractionTailSelector

aov = OneWayAnova()
terr = TransferError(clf)
clf = LinearNuSVMC()

rfe = RFE(aov,
          terr,
          feature_selector=FractionTailSelector(0.1),
          stopping_criterion=StopNBackHistoryCriterion(50, lateminimum=True))

for i, split in enumerate(NFoldSplitter(1)(data)):
	strain, stest = rfe(*(split))
	best_mask = strain.map2Nifti(N.array([1] * strain.nfeatures,
                                         dtype='int16'))
	best_mask.save('best_mask_%i.nii.gz' % i)

------------------
(I hope it runs ;-)

This will store the best RFE feature mask as nifti files.

It looks like we need a new 'StoppingCriterion' subclass to stop when n
features are left!

> > I am going to hack a little wrapper called
> > FeatureSelectionClassifier which would probably make use of already
> > existing MappedClassifier parametrized with MaskMapper using mask for
> > the features given by FeatureSelection algorithm). We just need a good
> > example afterwards for easier digestion
> >
> 
> I'm working on an analysis right now that could easily be generalized
> into an example once this classifier is in there.  Let me know when
> you want me to try it out.
I want!

> > > 2) I'd like to keep the classifier predictions and values for each
> > > test sample of each fold.  This, too, for later inspection.
> > enable states "predictions" and "values" for the classifier at hands and
> > access them later on after it got trained.
> >
> 
> If I do this for a CV, will it actually return N-folds predictions and
> values in the results?
> 
> > > 3) I'd like to know what is going on a little better.  How do I turn
> > > up a higher level of verbosity so that, for example, it tells me which
> > > fold it's currently on in the crossvalidation or which sphere it's on
> > > in the searchlight?
> > Now we have 3 types of messages (I guess I should have placed such
> > description in the manual as well... at least as the starting point may
> > be... Michael - I will do that)
Cool, thanks!

<snip>

> > > 4) I'm training on subsets of a recall period, but it would be great
> > > to test on every sample from the left-out chunk, returning the
> > > predictions and values for each sample.
> > I am not clear what you mean... can you provide an examples listing
> > which samples you train on, which you transfer to?
> >
> 
> Let's say I have 6 runs of data.  I'm actually only training and
> testing (via N-Fold cross validation) a subset of the TRs for each
> run.  However, for a single cross validation fold, I sometimes like to
> take the classifier trained on the selected TRs from the 5 training
> runs and then test the output of the classifier on every TR in the
> testing run.  This is not to test accuracy, but to make a cool
> visualization of the classifier over time and to see how it
> generalized to other parts of the run.
> 
> A specific thing that I have done in the past is to train a classifier
> to distinguish between semantic versus episodic memory retrieval and
> then I tested it on TRs where someone was performing a math task.
> This was a great control because the classifier was at chance for
> predicting math TRs, but was able to distinguish when people were
> actually performing retrievals.
> 
> I know how to do this if I'm performing the cross validation myself,
> but it might be cool to eventually be able to test a classifer a
> different subset of TRs than those used to train during cross
> validation and then return the prediction values.
If I got you right, there is no problem. If you do the CV is described
above you can take RFE'd datasets ('strain', 'stest') and use 'strain'
to train your classifer. You can also use the mapper in strain to map
any other data into the feature space of strain and test them on your
trained classifier. I hope that helps. If not, I'll try the explain
again in more detailed.

Cheers,

Michael

-- 
GPG key:  1024D/3144BE0F Michael Hanke
http://apsy.gse.uni-magdeburg.de/hanke
ICQ: 48230050