[Debtags-devel] Re: Do we need better documentation about our subsections?

Erich Schubert Erich Schubert <erich.schubert@gmail.com>
Mon, 27 Sep 2004 02:57:04 +0200


Hi,
i have written a bayesian network simulator before.
If we use the usual "naive" approach as employed by spam filters
(which is not mathematically correct) this shouldn't be too hard...

Basic concept is then that for each word (token, whatever) in the
package metadata you count for each tag how many packages have that
tag and how many do not have the tag.
If you have a high correlation you add the tag. (like 95% of packages
that have the word "ocaml" in their description have the tag
langdevel::ocaml -> if we have an untagged package, add the tag
langdevel::ocaml)
You multiply the correlation values for all words you looked at.

Gru=DF,
Erich Schubert
--
    erich@(mucl.de|debian.org)      --      GPG Key ID: 4B3A135C    (o_
  To understand recursion you first need to understand recursion.   //\
  Wo befreundete Wege zusammenlaufen, da sieht die ganze Welt f=FCr   V_/_
        eine Stunde wie eine Heimat aus. --- Herrmann Hesse