[Debtags-devel] AI for tag generation

Benjamin Mesing bensmail@gmx.net
Thu, 30 Sep 2004 10:00:08 +0200


Hello,


> >         non-spam training is quite useless, but I might be wrong there.
> 
> You are wrong there. When it comes to differentiating similar terms
> this gets important. I.e. when you have one word which is
> characteristic for two cases.
Thanks for the enlightment.


> With bayesian filters, the best approach is to use just *everything*
> and have the filter learn which of it is useless (for example the
> "Package:" word)
> Well, dropping the record markers probably is a good idea, i.e. s/^[^ ]*://
Well that is mainly what my little program does, also dropping things
like size and MD5sum wich are definetely of no use.

> It is more important to define what makes up for good tokens. For
> example, we should not use a dash as token-separator (to catch package
> names as tokens)
I considered white spaces to be the token seperators. It never occured
to me that there might be other possibilities :-) I agree with not using
the dash as separator here, breaking package names as Enrico suggests
could destroy the strong correlation of some related packages, e.g. if a
package depends on another one where the required package has a dash in
it. But perhaps we should use both for package names, i.e. adding the
version with and without the dash. A problem might be that this would
give more weight to the entry, but perhaps this is not so bad as package
names and dependencies are quite a strong hint towards the tags of the
packages. 


> I'm wondering wheter we should actually try not to take the "naive
> bayes" approach, but maybe take one level of complexity more (note:
> this increases complexity from O(n) to O(n^2) in terms of
> number-of-tokens) Maybe we'll need to add a significance value, too,
> for dealing with rarely-encountered tokens and tags.
Well the main problem I see here now, is not the computational
complexety, but the level of knowledge which is required. I for myself
could have imagined to develop something based on the naive approach
(perhaps using the bmf tool), but a more sophisticated one is quite a
bit over my head, but you seem to have some deeper knowledge of this
stuff so this might be no problem at all.

>   To understand recursion you first need to understand recursion.   //\
>   Wo befreundete Wege zusammenlaufen, da sieht die ganze Welt für   V_/_
>         eine Stunde wie eine Heimat aus. --- Herrmann Hesse
*g*
Well, it seems that both these are not translations of each other :-)

Greetings Ben