[Debtags-devel] tagging AI, bayes and fuzzy tags
Benjamin Mesing
bensmail@gmx.net
Tue, 30 Nov 2004 20:23:53 +0100
Hello
> I see the bayes network idea is being worked on, even if i only follow loosely
> the list. I have seen that you have slight problems with accuracy, and i like
> to assert that part of this is due to insufficient input data per package. So
> logical next proposal would be to widen the data base for individual
> packages. Candidates include /usr/share/doc/<package> (after decompression,
> no good feeding bayes with gzip data :)), manpages and so forth. And i tend
> to be convinced these should improve general tagging accuracy.
This sound very sensible and I wonder why this never occured to me!
Still there is one problem: this needs the full packages to be available
on the training system. This might be achieved on a powerfull server,
but is not possible on my system - so there would be not much
possibility to test this stuff until there is such a server available.
> Also, the
> script raises another question. How useful is a tag (say, role::utility), if
> the tags within tend to be rather unrelated? How is utility defined? I tend
> to think that tags that give very poor results with bayesian filter tend to
> be rather loosely defined and will be troublesome with human editors as well.
> Since there is no strong definition of role::utility, it is left to judgement
> of editor to assign it or do not. However, there is an user, whose judgement
> may be a different one, thus my worry that such poorly defined tags cause
> more harm than use.
Hmm, I tend to agree with this here. Nevertheless I think there are
cases which are quite well defined where bayesian might fail due to too
much diversity.
> Hmm, as i typed in the subject, i got another idea. It may be useful to add
> fuzzy tags, say, currently tags are discrete values, 0 or 1, for each
> package. What about allowing real range 0-1 there? ;). Well, this is a bit
> off the cuffs idea, but IMHO makes sense, but i'm not sure where is it's
> place ;). I leave judgment of implications of this on you, dear reader.
This sounds cool as it would allow ranking searches with most relevant
at the top! But I would definetely not want to implement this, lets see
what Enrico says :-) Another problem is, that it would make bayesian
tagging not suitable any more.
Greetings Ben