[Debtags-devel] AI Tagger

Mon Aug 15 06:14:34 UTC 2005

Hello Hanna

> I work on machine learning (specifically Bayesian techniques) and had
> a brief chat with Enrico about the AI tagger for Debtags at
> Debconf5.
Great! Finally someone who knows something about this subject :-))

I am not an AI Expert. I have heard only one lecture introducing the
basics, and have taken a deeper look at Naive Bayes for the AI Tagger.
So I am missing much background knowledge.

> > My answer for the first question is: we could give it a try, and see if
> > it is useful. The use case I have in mind for the tagger is, offering a
> > frontend where the maintainers can enter their packages and get a
> > suggested set of tags for each of them.
> 
> Sounds sensible. How well does the tagger correspond with human
> judgement? Have you done any evaluation of whether the tags proposed
> are the same/similar/better than those proposed by a human?
I have done very little evaluation until know. All the testing I've
done, is to compare the results of the tagger with the actual tagging.
Mostly the actual tagging was done by a human, or by the autodebtags
tool which derived tags from dependencies and other easy criterias. Of
course this is not very exact, because there might be badly tagged
packages in the database (e.g. gnucash tagged with  uitoolkit::qt,
uitoolkit::gtk, suite::gnome :-)
However doing a detailed analysis requires a lot of time...

> Let me propose another way in which machine learning could be used, in
> addition to the task your tagger is designed to solve: suggesting
> packages that may be potentially of interest to a user given some
> subset of the packages they already have installed. Essentially this
> is a clustering task and could make use of tags as input data.
This is a really great idea. There is such a thing in the debtags tool:
	debtags related <package1, package2, .. packageN>
which works simply by comparing the distance between the tagset of the
packages given with potential other packages.
I think the idea of clustering is very different from that of naive
bayes, so if you plan to work on this, there is too much to coordinate.

> I'd love to have a more detailed conversation about the machine
> learning details of your tagger -- specifically, exactly what
> technique are you using? From the Debtags list archives (thanks for
> forwarding me relevant links, Enrico!) it seems that you're using
> a Naive Bayes-based technique (usually used in spam filtering).
Exactly. I don't know if there is more to say, or if this says it
all :-) Please go ahead and ask if you like to know any details.

If you have any ideas how to improve the tagger, or have proposals for
alternate approaches for AI tagging please go ahead and tell us.  I'll
be glad to hand you the lead on this project.
However I'd like to hear the opinion of the other about how useful this
might be at all, before we both start to put a lot of effort into this.

Greetings Ben