[Debtags-devel] Another automatic classification library

Enrico Zini enrico@enricozini.org
Sun, 6 Feb 2005 00:08:21 +0100


--dDRMvlgZJXvWKvBx
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hello,

I feel stupid for having missed it so far:

$ apt-cache show libbow
[...]
Description: Bag of Words Library
 `Libbow' is a library of C code intended for writing statistical
 text-processing programs. Provided in the library distribution,
 there are currently three executable programs based on it.
    * Rainbow is a  program that does document classification.
      While mostly designed for classification by naive
      Bayes, it also provides TFIDF/Rocchio, Probabilistic Indexing and
      K-nearest neighbor.
    * Arrow an Altavista-like program for document retrieval. It
      currently only performs simple TFIDF-based retrieval.
    * Crossbow: a program for document clustering (and also classification).
 .
 Homepage: http://www.cs.cmu.edu/~mccallum/bow/

This library has been suggested almost 2 years ago by Javier Fernandez Sang=
uino
Pe=C3=B1a, but at that time my brain was unable to digest the description. =
 Now,
thanks to Benjamin's efforts at bayesian tag inference, I can and I feel
stupid.

The library comes with some binaries (even if the name doesn't suggest it)
which seem to be worth trying.

So far I tried this script to split the /var/lib/dpkg/available file into o=
ne
file per package:

-------------------
#!/usr/bin/perl -w

use strict;
use warnings;

undef $/;

my $in =3D <>;

for my $rec (split("\n\n", $in))
{
	$rec =3D~ /^Package: (\S+)/;
	open OUT, ">pkgs/$1" or die "Can't open $1: $!";
	print OUT $rec;
	close OUT;
}
-------------------

Then I ran "crossbow -i pkgs": it said it indexed things.

I'm now running "crossbow -c": it's done 207 iterations printing
uncomprehensible stuff, and now it finished.  No idea how to read the
output.  "crossbow --classify" segfaults.

Alternatively, I did "arrow -i pkgs", then with "arrow -q" I can do nice
queries on the packages.

"archer -i" and "archer -q 'altavista-style query'" also work nicely.

It sounds quite interesting: who would like to play with it?


Ciao,

Enrico

--
GPG key: 1024D/797EBFAB 2000-12-05 Enrico Zini <enrico@enricozini.org>

--dDRMvlgZJXvWKvBx
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFCBVHl9LSwzHl+v6sRAu3EAJ9JbcFLWhXlxz9xyCvbQE+lEtdKowCgg15m
lnQarOxiJ3dpJbk02JMFCbw=
=0FvV
-----END PGP SIGNATURE-----

--dDRMvlgZJXvWKvBx--