[Debtags-devel] tagging AI, bayes and fuzzy tags

Benjamin Mesing bensmail@gmx.net
Tue, 30 Nov 2004 20:23:53 +0100


Hello

> I see the bayes network idea is being worked on, even if i only follow loosely 
> the list. I have seen that you have slight problems with accuracy, and i like 
> to assert that part of this is due to insufficient input data per package. So 
> logical next proposal would be to widen the data base for individual 
> packages. Candidates include /usr/share/doc/<package> (after decompression, 
> no good feeding bayes with gzip data :)), manpages and so forth. And i tend 
> to be convinced these should improve general tagging accuracy. 
This sound very sensible and I wonder why this never occured to me!
Still there is one problem: this needs the full packages to be available
on the training system. This might be achieved on a powerfull server,
but is not possible on my system - so there would be not much
possibility to test this stuff until there is such a server available.

> Also, the 
> script raises another question. How useful is a tag (say, role::utility), if 
> the tags within tend to be rather unrelated? How is utility defined? I tend 
> to think that tags that give very poor results with bayesian filter tend to 
> be rather loosely defined and will be troublesome with human editors as well. 
> Since there is no strong definition of role::utility, it is left to judgement 
> of editor to assign it or do not. However, there is an user, whose judgement 
> may be a different one, thus my worry that such poorly defined tags cause 
> more harm than use. 
Hmm, I tend to agree with this here. Nevertheless I think there are
cases which are quite well defined where bayesian might fail due to too
much diversity.

> Hmm, as i typed in the subject, i got another idea. It may be useful to add 
> fuzzy tags, say, currently tags are discrete values, 0 or 1, for each 
> package. What about allowing real range 0-1 there? ;). Well, this is a bit 
> off the cuffs idea, but IMHO makes sense, but i'm not sure where is it's 
> place ;). I leave judgment of implications of this on you, dear reader.
This sounds cool as it would allow ranking searches with most relevant
at the top! But I would definetely not want to implement this, lets see
what Enrico says :-) Another problem is, that it would make bayesian
tagging not suitable any more.

Greetings Ben