[Debtags-devel] AI for tag generation

Wed, 29 Sep 2004 17:34:35 +0200

Hi,

>      1. We need to find something more efficient then to create a
>         different training database for each tag (I guess there are
>         other BF approaches for categorization which might come in
>         handy). I also think that in opposite to the spam approach the

Well, most do an either-or categorization into two parts. But we
basically need this times 1000 (we have 432 tags right now, ~30 of
which are special)
- usually this means the whole process will take 1000 times longer
(which isn't that much, we don't need to process thousands of packages
an hour like spam filters need to do)
>From mere complexity reasons, we should be able to afford doing one
level of combined statistics actually.

>         non-spam training is quite useless, but I might be wrong there.

You are wrong there. When it comes to differentiating similar terms
this gets important. I.e. when you have one word which is
characteristic for two cases.
Like "mail" is characteristic for both SMTP and POP3.
Mentioning "mail" increases the likelyhood that an application is an
SMTP daemon.
Mentioning "pop3" at the same time decreases this likelyhood - using
"smtp" itself will of course try to make up for that...

>         Actually we want to do a x->y mapping where x is the input text
>         and y is a vector of all tags for the package which strongly
>         reminds me of hetero associative memory using perceptrons - but
>         this might be more complicated than bayesian filters.

It's not too different. The strength of Bayesian filters is the
efficiency and the ease of learning.

>      2. It might be useful to have a database for the packages mapped to
>         their text and tags so one could efficiently access the data for
>         training.

To do proper training we need a set of packages which are carefully tagged.
If you train your filter with incorrect data it will be much less good.

>      3. We need to define which information should occur in the text for
>         the package used for the BF, e.g. the description, the
>         dependenies, the maintainer but perhaps also the filenames which
>         occur and other.

With bayesian filters, the best approach is to use just *everything*
and have the filter learn which of it is useless (for example the
"Package:" word)
Well, dropping the record markers probably is a good idea, i.e. s/^[^ ]*://

It is more important to define what makes up for good tokens. For
example, we should not use a dash as token-separator (to catch package
names as tokens)

>      4. For some tags we still need larger training sets as some only
>         have few package tagged with thewm, which would make the
>         training quite a joke.

Yes, this is the difficult part, that we do have very little training
data. To get a good spam filter you should feed it like hundred spam
and nonspam mails at least.

I'm wondering wheter we should actually try not to take the "naive
bayes" approach, but maybe take one level of complexity more (note:
this increases complexity from O(n) to O(n^2) in terms of
number-of-tokens) Maybe we'll need to add a significance value, too,
for dealing with rarely-encountered tokens and tags.

For those not knowing about the "naive bayes" thing:
The usual bayesian approach used in spam filters discards any correlation.
I.e. if "George" means a likelyhood of .8 the mail talks about George Bush,=
 and
"Bush" means a likelyhood of .9, then the resulting likelyhood is
assumed to be .8*.9
This is clearly wrong from a statistical approach (the occurrence of
"George" and "Bush" for mails dealing with "George Bush" certainly
correlates). But handling the whole - statistically correct - amount
of data would make the process way to slow, while only increasing the
quality a little.
This of course applies to any natural text, well, to just about any informa=
tion.
Since processing times of a few seconds, even minutes would be okay
for our apporach we could calculate scores also for word combinations.
i.e. we would have a likelyhood for "George" AND "Bush" vs. only
"George" and only "Bush"

Greetings,
Erich Schubert
--
    erich@(mucl.de|debian.org)      --      GPG Key ID: 4B3A135C    (o_
  To understand recursion you first need to understand recursion.   //\
  Wo befreundete Wege zusammenlaufen, da sieht die ganze Welt f=FCr   V_/_
        eine Stunde wie eine Heimat aus. --- Herrmann Hesse