[Debtags-devel] Another automatic classification library
Enrico Zini
enrico@enricozini.org
Sun, 6 Feb 2005 00:08:21 +0100
--dDRMvlgZJXvWKvBx
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Hello,
I feel stupid for having missed it so far:
$ apt-cache show libbow
[...]
Description: Bag of Words Library
`Libbow' is a library of C code intended for writing statistical
text-processing programs. Provided in the library distribution,
there are currently three executable programs based on it.
* Rainbow is a program that does document classification.
While mostly designed for classification by naive
Bayes, it also provides TFIDF/Rocchio, Probabilistic Indexing and
K-nearest neighbor.
* Arrow an Altavista-like program for document retrieval. It
currently only performs simple TFIDF-based retrieval.
* Crossbow: a program for document clustering (and also classification).
.
Homepage: http://www.cs.cmu.edu/~mccallum/bow/
This library has been suggested almost 2 years ago by Javier Fernandez Sang=
uino
Pe=C3=B1a, but at that time my brain was unable to digest the description. =
Now,
thanks to Benjamin's efforts at bayesian tag inference, I can and I feel
stupid.
The library comes with some binaries (even if the name doesn't suggest it)
which seem to be worth trying.
So far I tried this script to split the /var/lib/dpkg/available file into o=
ne
file per package:
-------------------
#!/usr/bin/perl -w
use strict;
use warnings;
undef $/;
my $in =3D <>;
for my $rec (split("\n\n", $in))
{
$rec =3D~ /^Package: (\S+)/;
open OUT, ">pkgs/$1" or die "Can't open $1: $!";
print OUT $rec;
close OUT;
}
-------------------
Then I ran "crossbow -i pkgs": it said it indexed things.
I'm now running "crossbow -c": it's done 207 iterations printing
uncomprehensible stuff, and now it finished. No idea how to read the
output. "crossbow --classify" segfaults.
Alternatively, I did "arrow -i pkgs", then with "arrow -q" I can do nice
queries on the packages.
"archer -i" and "archer -q 'altavista-style query'" also work nicely.
It sounds quite interesting: who would like to play with it?
Ciao,
Enrico
--
GPG key: 1024D/797EBFAB 2000-12-05 Enrico Zini <enrico@enricozini.org>
--dDRMvlgZJXvWKvBx
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCBVHl9LSwzHl+v6sRAu3EAJ9JbcFLWhXlxz9xyCvbQE+lEtdKowCgg15m
lnQarOxiJ3dpJbk02JMFCbw=
=0FvV
-----END PGP SIGNATURE-----
--dDRMvlgZJXvWKvBx--