[Debtags-devel] Bayesian Tagger

Benjamin Mesing bensmail@gmx.net
Tue, 19 Oct 2004 20:47:28 +0200


Hello, 

today I've commited another version of the bayesian tagger which is
slowly coming close to a first "real" version. I must correct my
optimistic prognoses of an average of 95 % correctness. There may be be
tags where this goal can be achieved - even exceeded (uitoolkit::qt
might be one), but the total average will be lower. Some tags like
role::utility will give no usefull results for automated tagging.
(Last measured correctness was around 70 %)
The new version takes packages with tags from the same facet as the tag
to be tested as bad training examples. This wield much more reliable
results (many gtk packages were tagged with qt before because of similar
libraries :-).
For creating the test set, I have added a new script called
./create-test-set.pl. But I think I will eventually merge the create
scripts together and form one one script froms this.
If you want to try the bayesian tagger run:
	./create-training-set.pl uitoolkit::qt
        # the next step trains with half of the data created above and 
        # tests with the other half (but this is no good test)
        ./bayesian-tagger.pl uitoolkit::qt
        ./create-test-set.pl -k 15 uitoolkit::qt
        # this does a real test and the result should be quite useful
        # don't forget the -nt here
        ./bayesian-tagger.pl -nt uitoolkit::qt
If you want to play around a little and train again using different
parameters, do not forget to remove the *db and countfile in the tags
directory.
Bayesian tagger is available as part of the autodebtags module.
	svn co svn+ssh://alioth.debian.org/svn/debtags/autodebtag/trunk autodebtag


Greetings Ben