[Debtags-devel] Using Python NLTK for tag generation [was: AI for tag generation].

Erich Schubert Erich Schubert <erich.schubert@gmail.com>
Fri, 1 Oct 2004 22:53:21 +0200


Hi,

> - Split the package descriptions (corpora) of already tagged packages in =
tokens;

A regular expression can do that, too, or just a normal tokenizer. No
need for NLTK for this.

> - Associate these tokens with their tags;

Trivial to do using associative hashes and arrays in both C (glib),
C++ (STL) and Perl.

> Why hard work writing grammar rules are needed here?

We don't care for natural language. In fact we are very interested in
the other metadata such as dependencies as well.

>From what i can tell, NLTK has dozens of stuff we don't need or want
to use; the remaining parts are trivial to do yourself. So i see not
much to gain here (except we would need to use python and introduce a
dependency...)

If you look at the URL you posted before, around listing 7 it starts
to go to real natural language processing. I.e. classification of
words into types (nouns, attributes...) then construction of trees
from that using grammar rules.

Gru=DF,
Erich Schubert
--
    erich@(mucl.de|debian.org)      --      GPG Key ID: 4B3A135C    (o_
  To understand recursion you first need to understand recursion.   //\
  Wo befreundete Wege zusammenlaufen, da sieht die ganze Welt f=FCr   V_/_
        eine Stunde wie eine Heimat aus. --- Herrmann Hesse