[Debtags-devel] AI::Categorizer perl module

Enrico Zini enrico@enricozini.org
Sun, 10 Oct 2004 12:40:04 +0200


--5mCyUwZo2JvN/JJP
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline

Hello,

I just found this, but I still haven't looked into it:

http://search.cpan.org/~kwilliams/AI-Categorizer-0.07/lib/AI/Categorizer.pm

NAME

AI::Categorizer - Automatic Text Categorization

DESCRIPTION

AI::Categorizer is a framework for automatic text categorization. It
consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can choose
what categorization algorithm to use, what features (words or otherwise)
of the documents should be used (or how to automatically choose these
features), what format the documents are in, and so on.

The basic process of using this module will typically involve obtaining
a collection of pre-categorized documents, creating a "knowledge set"
representation of those documents, training a categorizer on that
knowledge set, and saving the trained categorizer for later use. There
are several ways to carry out this process. The top-level
AI::Categorizer module provides an umbrella class for high-level
operations, or you may use the interfaces of the individual classes in
the framework.

A simple sample script that reads a training corpus, trains a
categorizer, and tests the categorizer on a test corpus, is distributed
as eg/demo.pl .

Disclaimer: the results of any of the machine learning algorithms are
far from infallible (close to fallible?). Categorization of documents is
often a difficult task even for humans well-trained in the particular
domain of knowledge, and there are many things a human would consider
that none of these algorithms consider. These are only statistical tests
- at best they are neat tricks or helpful assistants, and at worst they
are totally unreliable. If you plan to use this module for anything
really important, human supervision is essential, both of the
categorization process and the final results.


Ciao,

Enrico

--
GPG key: 1024D/797EBFAB 2000-12-05 Enrico Zini <enrico@debian.org>

--5mCyUwZo2JvN/JJP
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFBaRGE9LSwzHl+v6sRAtgTAJ9nytFKUnAznfyMaAFe+NUy/DtUWACdFW6L
4zzqbUP3r+vg16k7fl74/bc=
=h4kW
-----END PGP SIGNATURE-----

--5mCyUwZo2JvN/JJP--