[Debtags-devel] AI-Tagging further steps

Benjamin Mesing bensmail@gmx.net
Thu, 28 Oct 2004 20:54:12 +0200


Hello,

the svn version of bayesian-tagger has reached a state where I have
implemented all features I planned. It is now able to be trained with
single packages, avoids duplicate addings and packages can be removed
again (e.g. if they were accidently trained incorrect).

We now have to think what further steps we want to take. Using it in the
autodebtags script for verifying the results is something that I
mentioned before and I guess it would makes sense. While testing I have
already discovered some packages which were incorrectly tagged by
autodebtags (e.g. snd-gtk-alsa has uitoolkit::qt). But to do this we
need to set up a place were the databases for the different tags could
be stored so it can be distributed (the databases taking around 500kB -
1MB for each tag).
Another use for the bayesian-tagger might be to help developers (or
taggers) decide which tags they should use for their package. This needs
a webfrontend - perhaps a cgi-Script might come in handy here. If we
later want to exchange the AI engine, perhaps by something Erich will
creat (btw. congratulation Erich :-), the webinterface will still be
useful. So this shouldn't be a waste of time, even if my AI proves to be
to inaccurate. The webfrontend should also make use of the autodebtags
script.1

Some words about accuracy. The results are still not overwhelming (and
probably worse than those that are achieved with autodebtags, for the
cases autodebtags can handle) but should give a rough guidance for
tagging. However it is essential to allow further training as there are
a lot of tags which don't have more than 50 packages in it (again a
webfrontend could help here).
I will give you the stats for some random tags that I tested, so you
might get a feeling how much use bayesian-tagger might be. The most
significant numbers are matches and mismatches:

        ./bayesian-tagger.pl role::server
        Tested packages: 491
        Expected to be good: 246
        Expected to be bad: 245
        Matches: 428 ^= 0.871690427698574
        Mismatches: 63 ^= 0.128309572301426
        Expected good, but wielded bad: 20 ^= 0.0813008130081301
        Expected bad, but wielded good: 43 ^= 0.175510204081633       
        
	./bayesian-tagger.pl -nt implemented-in__c
        Tested packages: 99
        Expected to be good: 50
        Expected to be bad: 49
        Matches: 63 ^= 0.636363636363636
        Mismatches: 36 ^= 0.363636363636364
        Expected good, but wielded bad: 18 ^= 0.36
        Expected bad, but wielded good: 18 ^= 0.36734693877551
        
        ./bayesian-tagger.pl -nt media__mail
        Tested packages: 271
        Expected to be good: 136
        Expected to be bad: 135
        Matches: 230 ^= 0.848708487084871
        Mismatches: 41 ^= 0.151291512915129
        Expected good, but wielded bad: 6 ^= 0.0441176470588235
        Expected bad, but wielded good: 35 ^= 0.259259259259259
        
        ./bayesian-tagger.pl -nt uitoolkit__qt
        Tested packages: 648
        Expected to be good: 324
        Expected to be bad: 324
        Matches: 616 ^= 0.950617283950617
        Mismatches: 32 ^= 0.0493827160493827
        Expected good, but wielded bad: 24 ^= 0.0740740740740741
        Expected bad, but wielded good: 8 ^= 0.0246913580246914
        
        ./bayesian-tagger.pl special::meta
        Tested packages: 81
        Expected to be good: 41
        Expected to be bad: 40
        Matches: 76 ^= 0.938271604938272
        Mismatches: 5 ^= 0.0617283950617284
        Expected good, but wielded bad: 3 ^= 0.0731707317073171
        Expected bad, but wielded good: 2 ^= 0.05
        
        ./bayesian-tagger.pl use__configuring
        Tested packages: 126
        Expected to be good: 63
        Expected to be bad: 63
        Matches: 112 ^= 0.888888888888889
        Mismatches: 14 ^= 0.111111111111111
        Expected good, but wielded bad: 8 ^= 0.126984126984127
        Expected bad, but wielded good: 6 ^= 0.0952380952380952
        
        ~/lang/perl/autodebtag> ./bayesian-tagger.pl hwtech::cd
        Tested packages: 40
        Expected to be good: 20
        Expected to be bad: 20
        Matches: 35 ^= 0.875
        Mismatches: 5 ^= 0.125
        Expected good, but wielded bad: 1 ^= 0.05
        Expected bad, but wielded good: 4 ^= 0.2
        
        
        ./bayesian-tagger.pl interface__commandline
        Tested packages: 104
        Expected to be good: 52
        Expected to be bad: 52
        Matches: 69 ^= 0.663461538461538
        Mismatches: 35 ^= 0.336538461538462
        Expected good, but wielded bad: 11 ^= 0.211538461538462
        Expected bad, but wielded good: 24 ^= 0.461538461538462
        
        And finally there comes:
        
        ~/lang/perl/autodebtag> ./bayesian-tagger.pl data__font
        Tested packages: 30
        Expected to be good: 15
        Expected to be bad: 15
        Matches: 30 ^= 1
        Mismatches: 0 ^= 0
        Expected good, but wielded bad: 0 ^= 0
        Expected bad, but wielded good: 0 ^= 0
        
        *grin*
        
Also note, that the results might improve a little as some false
positives might prove to be correct ones. Another thing to keep in mind
is that false positives should be better than false negatives, as false
positives will show irrelevant results to the user, but false negatives
will hide will hide relevant packages from the user.

Thanks for reading through the whole mail.

Greetings Ben