[Debtags-devel] AI for tag generation

Benjamin Mesing bensmail@gmx.net
Wed, 29 Sep 2004 16:14:55 +0200


--=-3ez8kQmFTS/hyhRxEmlL
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

Hello,

I have taken a closer look to the means of creating tags via bayesion
filtering (BF). I still have only scratched the surface, so I can give
only some guesses. BF seems to come in handy and I have found a little
tool which can filter arbitary text to be or not to be spam.
http://sourceforge.net/projects/bmf (there is even a debian package
which makes it quite sympathetically :-)
I have played around with it for short. Defining packages with the
security::firewall tag to be spam and all others not. As this tool does
only support binary enquiries this is all one can get. The results where
encouraging - even with the small training set I used (perhaps 10 for
spam and 10 for no spam) and only feeding the descriptions testing
packages wieled results quiet well.
I will list some points that I think we need to consider if we really
want to use BF for tagging:
     1. We need to find something more efficient then to create a
        different training database for each tag (I guess there are
        other BF approaches for categorization which might come in
        handy). I also think that in opposite to the spam approach the
        non-spam training is quite useless, but I might be wrong there.
        Actually we want to do a x->y mapping where x is the input text
        and y is a vector of all tags for the package which strongly
        reminds me of hetero associative memory using perceptrons - but
        this might be more complicated than bayesian filters.
     2. It might be useful to have a database for the packages mapped to
        their text and tags so one could efficiently access the data for
        training.
     3. We need to define which information should occur in the text for
        the package used for the BF, e.g. the description, the
        dependenies, the maintainer but perhaps also the filenames which
        occur and other.
     4. For some tags we still need larger training sets as some only
        have few package tagged with thewm, which would make the
        training quite a joke.

I have written a small application which collects some useful
information about the packages and dismisses unnessecary thing. It is a
quick and dirty hack and would perhaps best be written in perl as it
does mainly some text processing, but as I could reuse the code from my
program I wrote it in C++ using QT. The program writes all information
the package in a textfile along with the tags for it which could be used
to feed the BF for training. I have attached it to the mail so you could
try. Currently the results are written to ./test.txt
Take a look at the Makefile, adjust it and run make to build the tool. 

If all issues mentioned above are solved, we could write a little GUI
for doing the actual training and testing the result of the BF for new
packages.

So far my thoughts about that.

Greetings Ben

P.S. Why must informatics be so cross connected? AI, DB, usability and
categorization. How are we supposed to be experts for all this stuff :-)

--=-3ez8kQmFTS/hyhRxEmlL
Content-Disposition: attachment; filename=informationcollector.tar.bz2
Content-Type: application/x-bzip-compressed-tar; name=informationcollector.tar.bz2
Content-Transfer-Encoding: base64

QlpoOTFBWSZTWUnDDooAC3h/mNyyAKB5//+ff+f/L/////skAAEAAAhgDk+8dYdcFEenXt9999L7
7fbZu3CtHtq+voH3sCHWFVfYKYve7CSIgieinhJ5qjzSnk1M1PRqHpHqNANAA0DQbU0A0GgQCaEy
IbSjIABkANAAAAAyAANDRDUIkaGQMEDRoGgAZBpoAwQ0DJpgSaiUyTalNPZEyjxDTUPU9IAABoAA
AAAAIpTQ1T9Keqfop4p6QA9QGmgNNAB6gAADIAAEiIQJppMCCeinoJgp5NNI000aGg0NBp6jQ0Aa
fQXxFPx8R7Hw8xiRlegU/7g7ISagEIJcmU2612WUXkyExFSYbtXyVl04WFZiYsmLqFGKcoEYIFhL
Ai9bTJEFVEQIjFU7pDgoLIw+ko8rI/q61I4ozs4mTUyEXQakLZQgIIwiqRgRhGCGBqKIHnPezoUx
lMID20tbfdTUgt2+Y+CvlcCEqKEOlJMgM2HaQmzGMsTZo0bzMdOl0m7ViydDLTCRYYfVBc1c0jTO
Lxsd7hxiRLDNYxSIhvoFEPhBDulG5+W1Ti4ZMXa08+jXfjcZMv46pf11mQY2e5h0FFiulQua40JB
W5Mgbb1dBK38DRovasW4ZkCG+GBsd1PAC2CocOSzFyo29mGj6yOrb7Q7Lp3qBaSxvqyRDBRbB2ti
+pslooxWMMbuO+iwmuxshMYjqZfLDjL6FCA0p/qb7U5Zrr7VSmi1dK/ncxUKKEcg5SngTiQB1KRU
E2kfHWdUthDyoSZM1qJXe32Ci4wZ3KLsdxyGLRsQulGeEWeba6vPAUCRvipUCTK2LIDxCIctkEIw
K17a6M67wEwkQq9KpBp3vdf1sxrARJ4LBoauThSD60svrS6Mj0XKaNZr8NFVdDZbqab7cLTxTZPN
eIi5H5C+Dg46Iq6Bnm4aA94kEdWKwNTgkhaVgb6c41DN2mqSXix4dFSmVZ6Yh6+/Ghr8hnRbbJfg
GGiOC5YYi5FNEqaCJnegoH3xSJugccYtYHSMnm0Ka0myrsi0IIQe2xOxUODyRaK+MyPaxt0znU2K
yzNAI5INPNprNM7q5pFbwBC8BdREKFfJkAAcF2L3IhoCEiNuSrjJu4xXtatJBGd4PygdDD/zBb0X
jX1NcQvaQYGBIKnk21vNHHazlTYkQ+2ZoN3dFXN7xeFouGQV5ajytHS8f6Z3Bfql2xeU5weKNAKv
NYuCDxZIM3QGdORcFmjpv4qnXQxjuPd4rgCjL2E+n28qVszPXdUroQumsTEJHr26iIgiIQ40mPR/
X3gMHcFX3NXmAQesJ4qQk99UNNPG60MbN6NaBS5VyOos1Ml93obBDowrEg9zimmCMjcsCh2QE1pK
iqmRnVSduZZ0dC1oCHXOD/amu6iRrdbNUp5Tfgwgg86STl5mIJJ1xJSCZANOpmzuqNty7fR6e+Lu
lQktIwJKcWbVvnBuWL4RhSY6R1aDXWMJWVDBX3GXlsWofFbIfFS95FupgUyYqSRDUvCwqCsljFxb
VqQYimHgE1VLEKImsgclhWEx2NKBQlnqIWwRYt89IEi9BIS6VdrpkYuwrraj1ng8y9/tW1dtHdhP
cFGorZ/Sz2PegQM4BE7+AeY7sVoMeDDDxyBMBqKDLoVtaPCxLIVma2UClg0qC7oVsItJCBlUNfbW
D9ssC6g/BhW7GMGL4WluPmEFmJcBBenW/FyLAq/aTvdiBrKVoJwvyDD1RKReOGjAE8DGnsw5Bue1
mEsE1b8drhinhcLZoOxPgTGhfJWoaqf6PPFq7rQwvlTXNq0YIsPsD5HZHIM66BgVTxphshRVbz+K
3HMjgZSZv9TbaIldTnQvN8/dhnNGf2JBtt15rbD+sgx91OWp0cItawOvEuRa1rUqmw16U+xsSQ2m
eu9XKghUbFz2ESz0jgpGS4K7d9kb1kXTHxOQ5PAs0WdNkUEDHCEMBApVhTQ53Lni5brR2nnoOqat
ziFJl/n9ngnlIYgCPSWZu3ptVNh7rDpd0W8uVXkvnTtIDgaLQ3hcOnOKURImallJ4Dt5tRxnXDzs
8unJWCRRQYCZ6HozQ3EASBtxv9prEYIGOtDQo1SMX7Idmh66JqW+9qeap+g0yvIfWBmiOTPYeI6Q
zFeYHuuWZuUXFRdWRUavD1HzJubGWtJNiNEPKVy/zjTifDAEe5DjvFC/SdSeawv7JBMW0+KDLzdW
csPjyz3Qc4gZ7xCOs0pqHgVHC0JSA+fdnkhGw+/hSiV8pi7YpER5QxQRBMrbTJay11dawG6BUxpK
pyW4XxuSiMlJ4EOOZBBEuCXupD3tFpXM5oKMkV5hrBGLfIGcyy/g3cd1vmyFmHDVbbPJVUeTQu1V
W28iAY8INTEpwng9Ih16Z713QkoQFtu1XW3QeEZODHMB1a4CvyMmUqzdKh0YtT8dYqryMHZXZZ6s
77rY/9xdYIMdKqibs5b40rQbG7FRZWD4QKy8OHd3sGD9K8olAF0lIKJF4qEIyEwQdUmzXuy9pm81
vTLB5qoqqosCWE3TJvTnrOxnaFcsMAuvuCukGqVqcAMWa8Cy1cLC4hrKk183JDtFc2XCiiii384b
Jr1AyDwzPjm5Fsrta7LjlBt30ikUUuGOWy4otZbzybus1itVj9ODHF8o6NUGTS6oD4xyDohmciIi
GMDIyMzBnuBeUXdww1I3rnCZqUBigWqqFQpd8CvpCNEaCCpZXd0PWJ4l6M6hFGXnPVMn6Uo6JkTm
MgKN8qgUw1qw8ZIasSLjWxIWDQYtEMGxsLkxI1wCDoRceZjdvGLQISkL/uLVwxEGUnSuIZypYicJ
ICGhLi0SWdPVuqd90Jg7J2YcUNYhlR4lqrVIRCw74RyF3Ciy7DkAZEeQ7arMPAfnSTTSaCEC6H0Y
XAwhExziMARwvzGMComG0Vvi+32/b97HToaYLgztZI0mxQLtGym1LkxmOQWSukZ8yZUS+BGIIgUL
3oZJjnB5IUbA7REYndCfFOUViduYJ4rdftPpVDVguPI8QUq1uDzo3jHkFVtXkBA1C6n6mbPO1UGb
YHKlFUfjvLlPWFSvZd7BgFmg7hWBEyK5q2WF67EXTrG00fdGbXzrhgZREA2QkspFbSkD1rSyBQrg
shFy156E4Ft4zpob1n38zvkPQbsRCpS8yZkNE186NYdFkrguLmEIkmm39f0Ap4VOSyIVBYc8IMml
DlwjOCKxd1qxWEVowUX7wAxELiCYaUVGEwiwWEl5vwMKHVCGJOxbrKhjF85gIYDAIoBxTOWBZwQE
GIBJqBhcMTPzzSVDUzKZSGZiVASW2NAsg2MKgMrbSSQDMFy660pkyBMV6GmcV/qCyY6qsYt1CTQ0
MXA4YTRgoECqhlQoVqCoS5GPwe9r5F1Vj0mpmMhAOU0jE+M59ts9ahCV5rngsqLG2zyM08qEdEXg
ZB6EZG87H9nFyvN8vpy1jFR5v33AwNHT6LYxk/CIg2LvlHUtl+PFdGs0o2EERPzIGLqSTDAFT4Dt
0XoDZ6Z2GAoI6xwTBAsRiiOdERG6K4MS5v1ChJgCod4LGtlUTSCPyM3B6lswXj4cDhUvBrq+3sFr
3Ary/PNOtVij6xmbhTsB4Kc8SSQ3bjbJOrigBuQqHjOQbvJQwag1lwL0wlc1oNQiMSSTM26E7z14
vJS+edRrcFTOqDZRdla6MmXeu7OlmAdwyriIuOffgxBDsQhJvvWsh1VqoAKIgmaYRJJDiZgkvc1M
QWBZJXrPMDV2euCGmeCA7BI0tOKkWmxqlmGoKlQm92XXBLsP4bsSjY98F4FNmNogacyUKFWiKtFR
ZY4i5uRnayX17XipoyzNaUwdx4n5p0RvM5lfWtiDVrfzvznOZnpPCxXD4s5B3RRv2XSpv1x1ULMr
zukLbEhshoV1cRYpVRMNj5ehMStzRiSIDiAq3BpV4TQ1IRibGDGqYMsDaVxsS58ns4shIt33oFCV
4J+qEKcNA8Ovn2m2qxLjb8+lQaSDpaIoHRZnM+yTfHKXLiwsSpKpoGmUFFLFGBKk1Adx4hMYgyGW
2GDJOgbm1mweEBRVAUjFYqqooLBGIqQTDPZjPKIG2aihAaw3oVFpfQEyu/lmZ2WJvQG/CHNQpjkE
864pq6xKxN0iYa4QtTT+io0aMhzCr3WQjFa6qdQz0IkpCO5HoQEwWSyTPIYDI9O6gdyoUeXW2eRn
J2UKZ243CNYtWM4ZjGJeJgLBEJQgqEhh0VS2EE/4u5IpwoSCThh0UA==

--=-3ez8kQmFTS/hyhRxEmlL--