[Debtags-devel] AI for tag generation

Enrico Zini zinie@cs.unibo.it
Wed, 29 Sep 2004 19:22:35 +0200


--Q68bSM7Ycu6FN28Q
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline

On Wed, Sep 29, 2004 at 05:34:35PM +0200, Erich Schubert wrote:

> It is more important to define what makes up for good tokens. For
> example, we should not use a dash as token-separator (to catch package
> names as tokens)

Or we can decide that it makes more sense to split package names, so
that gnome-session matches "gnome" and "session", which is not bad.


> Yes, this is the difficult part, that we do have very little training
> data. To get a good spam filter you should feed it like hundred spam
> and nonspam mails at least.

We could add a special::complete tag to things we explicitly dedicated
extra care, evaluating all facets we have, and then use all these to
train the bayesian thing (stripping the special tag, of course ;).

At the beginning, we'll have 10 complete packages, giving a bad robot.
Then, as time goes and we tag more packages better, the robot will grow
with us.  The nice thing is that the robot will grow without us putting
special efforts on it besides keeping doing what we do already.


As a side note, it struck me that if we make a web interface to the
filter, it could also work the other way round: when I'm writing a
package description, I can see what it could connect to given the whole
metadata database, and check if my description is good enough or if I
made mistakes that suggest something else.  Using automatic tags to test
package descriptions...  I find it fascinating!


Ciao,

Enrico

--
GPG key: 1024D/797EBFAB 2000-12-05 Enrico Zini <enrico@debian.org>

--Q68bSM7Ycu6FN28Q
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFBWu9b9LSwzHl+v6sRAvIxAJ4qkQNemGUvRAgLxoEWV4EBJKqV7wCdHV5W
khsaeOj/jamp+gSmTlSFruE=
=3E9K
-----END PGP SIGNATURE-----

--Q68bSM7Ycu6FN28Q--