[Debtags-devel] AI Tagger

Benjamin Mesing bensmail at gmx.net
Sun Aug 14 19:54:47 UTC 2005


today I have played around with the AI tagger again. I've adjusted some
small things, but I do not intend to invest much more time into it
without having discussed its future before.

Currently the AI Tagger is quiet usable. I have also provided a web
interface which I did announce some month ago. 

It think the first that must be discussed is the purpose of the AI
tagger. The first question is: do we need it at all. And the second: If
so, what for.

My answer for the first question is: we could give it a try, and see if
it is useful. The use case I have in mind for the tagger is, offering a
frontend where the maintainers can enter their packages and get a
suggested set of tags for each of them. I've attached a screenshot of
the web interface for the described use case.

Now, a word towards how the tagger works. Like most AI approaches the
tagger needs to be trained with some known data. Currently there exists
a set of scripts that automates the creation of training data from the
packages already tagged. This information is taken to train the
database. Each tag has a database where the information it learned is
stored. The information used for training is the one found in the
packages file. 
All this means, that a training process must be performed once for each
tag. With the approach described above, this can happen fully automated,
but takes some processor time and a little space (around 200-400kB per
tag, depending on the number of pacakges used for training). 

Finally I can come up with a little statistic of a testrun I performed
today. I have trained and tested for 22 randomly selected tags. The
results are summarized in the following statistic:
        Tested Packages: = 1512
        Matches: 1182 ^= 0.781746031746032
        Mismatches: 179 ^= 0.118386243386243
        Unsure: 151 ^= 0.0998677248677249
        Average match percentage: 0.76898062445829
        Average mismatch percentage: 0.151154975979158
        Average unsure percentage: 0.0798643995625524
        False positives: 124 ^= 0.082010582010582
        Average false positive percentage: 0.135687706904971
        False negative: 55 ^= 0.0363756613756614
        Average false negative percentage: 0.18310061616581

Where the "Average {match,mismatch,unsure} percentage" is the percentage
for each tag, divided by the number of tags tested. The full results can
be seen in the attached test-results-2005-08-14-beauty file.
Note that the percentage depends strongly on parameters of the training
and testing, but the above should give a rough estimation for the
accurateness. I also expect the results to become better when more
packages can be used for training.
Also note that with rate of 0.14 false positives a number of tags of
400, we get a number of 0.08*400=32 proposed tags per package which are
wrong. This seems quite much -- we could restrict the tagger to omit
tags where the tests indicated a precision of less than 90% (having a
rate of perhaps 0.04-0.05 of false positives). Still the signal noise
ratio will probably remain < 1.

Currently the web tagger is implemented as a collection of html pages,
perl modules and perl-cgi-scripts run by a tomcat server. For a broader
use, this must be placed on a server running 24/7. Is such a
infrastructure available for test purposes anywhere? Also currently it
uses a CPAN perl module not yet packaged by debian.
I think, one day should suffice to perform the training for all tags on
my PC. However I have no idea how long it will take, to test a package
against 400 tags, because the complexity is linear. So the whole thing
might be for nothing...

Ok, I have bubbled enough for now, please tell me your thoughts.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: web-ai-tagger-1.png
Type: image/png
Size: 14055 bytes
Desc: not available
Url : http://lists.alioth.debian.org/pipermail/debtags-devel/attachments/20050814/364293e5/web-ai-tagger-1-0001.png
-------------- next part --------------
A non-text attachment was scrubbed...
Name: web-ai-tagger-2.png
Type: image/png
Size: 27288 bytes
Desc: not available
Url : http://lists.alioth.debian.org/pipermail/debtags-devel/attachments/20050814/364293e5/web-ai-tagger-2-0001.png
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-results-2005-08-14-beauty.gz
Type: application/x-gzip
Size: 1935 bytes
Desc: not available
Url : http://lists.alioth.debian.org/pipermail/debtags-devel/attachments/20050814/364293e5/test-results-2005-08-14-beauty-0001.bin

More information about the Debtags-devel mailing list