Google Summer of Code

Thu Jun 1 09:07:27 UTC 2006

> [past experience] One problem I had when I tried to use dbacl to review
> tags is that in dbacl the size of the training data matters a lot.  This
> was a problem because the package data for {all packages with tag A} is
> usually much smaller than the package data for {all packages without tag
> A}, and that would produce a biased dbacl discriminator.
I worked around this by choosing for training only as many packages that
do not have a certain tag as those that have them. Note that IIRC on the
theoretical foundation the good:bad training ratio should match the
likelihood for the tag to be on the package. However I've found, that
the 50:50 (or perhaps I did 50:100 - I made it adjustable)  worked quite
well.
Anyways if you have a small training set results will always be bad.
Also some tags/facets are better handled by the AI than others (example
for bad: distinction between role::sw:application and role::sw:utility,
which even humans don't get right..., example for good: x11::font was
always handled well).

Best regards 

Ben