[Debtags-devel] First Preview

Benjamin Mesing bensmail@gmx.net
Mon, 11 Oct 2004 21:33:47 +0200


--=-qHrXyRTX9UKeEGtYju5m
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

Hello,

thanks to encouragements from the list (thanks Thaddeus and especially
Enrico ;-), I have been working hard the last few hours to get the first
preview ready.

The two scripts can be used to create and test for the tags you whish.
Run 
	./createTrainingSet.pl -k 40 data::font
to create the testset (a directory data__font will be created were all
stuff will be stored). There will be a good.list containing all packages
returned by "debtags grep data::font", and a bad list containing every
40th (that is what the magic number -k 40 is used for...) package
returned by 
	debtags grep "! data::font && ! special::not-yet-tagged"
After this run
	./bayesian-tagger.pl data::font
and the first half of the list will be used for training, the second for
testing, so the training set will be a little reduced.
There are some command line switches for bayesian-tagger.pl noteably
'-v' or '-v -v' vor verbose and very verbose. Using -nt will not train
but test all packages. The -p|--package option does not work yet

There is still much to be done before the script can be really used but
the first results aren't too bad, even though I think the AI engine will
have a difficult time with some tags even when it is complete, where
others will be quite an easy job.
Fortunatelly I used to tested with the data::font tag which wielded
extraordinary good results - which kept my motivation high *g*:
        ~/lang/perl/bayesianTagger> ./bayesian-tagger.pl data::font
        <skipped warnings>
        BAD: bad package gsfonts-x11 did match!
        BAD: bad package xfonts-konsole did match!
        BAD: bad package xfonts-bolkhov-koi8u-75dpi did match!
        BAD: bad package xfonts-intl-phonetic did match!
        BAD: bad package gmt-coast-low did match!
        BAD: bad package sylpheed-claws-i18n did match!
        
        Tested packages: 201
        Expected to be good: 15
        Expected to be bad: 186
        Matches: 195 ^= 0.970149253731343
        Mismatches: 6 ^= 0.0298507462686567
        Expected good, but wielded bad: 0 ^= 0
        Expected bad, but wielded good: 6 ^= 0.032258064516129
As you can see, the four topmost errors which the categorizer complained
about where things which should really be tagged, so it told us that we
should tag those with data::fonts :-) The last two are real errors. 
Unfortunatly all other tags I tested, did not produce results that good
:-(

        ~/lang/perl/bayesianTagger> ./bayesian-tagger.pl uitoolkit::gtk
        <skipped bad complaints>
        Tested packages: 716
        Expected to be good: 615
        Expected to be bad: 101
        Matches: 476 ^= 0.664804469273743
        Mismatches: 240 ^= 0.335195530726257
        Expected good, but wielded bad: 210 ^= 0.341463414634146
        Expected bad, but wielded good: 30 ^= 0.297029702970297
        
Still there are a lot of improvments to come. E.g. currently I train
only with the package description, but the dependencies will do some
good. Perhaps even files should be considered. Maintainer and section
will play their role too, I hope. 

I think I will implement the extension of the training data soon, even
though I am quite busy currently, but really eager to improve the stuff
*g*.

I hope you will find some time to play around and tell me what you
think. Btw. do not train with a tag more than once (or delete its
directory between the tries) as the results will accumulate in the
database - actually this is not a bug, but will become a feature in the
future :-)

Greetings Ben









--=-qHrXyRTX9UKeEGtYju5m
Content-Disposition: attachment; filename=bayesianTagger_2004-10-11.tar.bz2
Content-Type: application/x-bzip-compressed-tar; name=bayesianTagger_2004-10-11.tar.bz2
Content-Transfer-Encoding: base64

QlpoOTFBWSZTWVHVDmIAEDL/kfjwAKB//////+//j/////8EAKAAAAhgHI4D7ljNtxQHWoSoCOOm
zsokd2fb3PM3O3uwUzdDq6ycI9R1p2dVGq2bWxL0c94a7a01e3Fxm282rrmthF7u5vCRIIAgCaAB
pNJgSeQjJtJinqaPKeSBiD0jaT01HqDECCIyIBKfqY1QNHlNGmh5qgyBo0NAAA0AAcDQGg0aBk0A
BoADE0NNA00ANGjQANGgk0kkClP9TTTIUNknkjRiGnojagbUNA2oP1QAAGnqACJSTRqj1PU9TTR6
T1ABo9QZHqHqM1G1AAAAPUANAAk1IgRoIjNE1PQp5Twk09T1MagG1MIADT0gAAAf8L+C/edHSd1y
/MzYhGF4AB3+BpLlK1MuUlo8CCcO0w65xmJkAlhrASMWBpG2SQPBDQRQFFUYrBERjCtEECUQEWtE
paIpWoVowWIgljRRLL4vZMzUFBYoZVYUvkMw2bNJVPhwlWKKRgjBFBUYwR9KNlJvpZER4pRijAFF
GGNhKwlYGqFBy0VS0lBITKZgUiEGFVQIAlMKCWP5dPzbSUh7Hrtqf6z1Z+k4SLj5f69Lj6mA6Nkc
sGQQQQGc+48SviX8Yf7jRjDOiGYYDJzeGw+rmuN3F1ezEUkikFdi6gMVqRQpAwYHOliKsWEQxV2+
agL9DLwQZS2DQTub7OMKm5qoKazQbmshTvEiF/V8VyCsEERZqCttF93+aPkvV6Np1ePs8/aE6L6e
Wjv9vluO4tpiK21ltdYkHSd3brT+nq08OzAXmdw58/cHDCAHTCyqmk1qUYT/uYQvImXAYHIbIH+k
Ajft33FPdDdpE4eau40LMax45cs886z1Cwzihs1vISt5RcSIgxUUVSloiIKOngPM34Z2uW2RTEMO
pC2mrOdDwRZ9UBe542eEtQjqoh7ZEmI/x0BijS0DdkQ2IPEWlGz2wfRjtjajxnhLYrDuGSNX89x+
q7+Br8wZh9Yw8sHfmczRz112CoGhEQINsDCLJB52x9EjpJD1ZwU4xW6w2TJ5BF7Vpb13EDh2PMH0
4h/ds8iEZZnw5Vm04SOEUpQ8q95gxtwbgNpILArUoY3kDLaxK7CIiMZm+RY0SN0ZaAwVA0ULLIxf
RUj8ogt/79UeS8okw3fti/CDUypbgH9HSZpOSHzy1+2wxRzAwK5EK5DEmgH8kcrlyC06/pWaw/I5
PwYZFDmwGyCzwozixbfV+GFn7bH3KJbouN/X7JczXoCvqOC6DqTINnonMmjq9ME5lZymiZptfv4f
Gc+k1h/S8084n39Eu7WUzrU65PFUil+4cBOr8WcgTYImY/uahlKGPpv8k0DTKiqZE5ZuBwfnD0A7
cMHaZVuDnm1IMRYMVDa94RERDEmrlriJBAcq1d87L7IjyheUrNf899tvKOjoeR28oGCPaV1iqQT3
yGHN270CISARyK1Vaz52F/IFCyCy6DyaVWRlss5g17pFb9JBhQ6Ja2uYHBYeA/Ho0ftzgDAxs7xu
bd+J8BEu22u3GA2UN5Grgd2db8tt0OIHDWZHQedbL9+so7vEU23DV0gzANk6jbeOM8npAchtZykG
5hj8wjTRxWbYAjwE8EB49N/h/G8+SejMIYVIudUGhBcXZyrIsGHNUkeQnRU4LAkbw527id9gipbS
T5t5KWBFB2zNMgu4F/CetyVmjWGtnu4h3ryxpo7X994SnMh7zAxVJGUsoVk5hMztt3zdIdDXWnmx
2wxULu4pN1BPJ+tVYyrHXTeR4OzmpmWnza4zWshSaVQeXhdAXpNbaK7/aQYblVJdQsumL877B4nm
muXG74I9turyG3DEhDSAH7ER3oNyPFw12zdr6sMP/lzsbxPb1IQcdx4s9xcYwnaXrT1D2ddRrvCv
rpRA0nK3WLdPd0DT2+V9bDsFShOGI0sODiObBh7LbT04iOPQJqeP23rZgtHVD0QJG2mS4HMYdqDw
8pC7OGAhuC5BYNimXsXv1XPMMKGjsJQuKzWKyJsluoG+Kcjd0NnPfvXWqkaTduMPvlmYN0wysiF0
rk3OJHk8FzjRTfbEaiRur5yUbH5GNSu13BJ2xp3bc5V3M6Wpb+jG/ZKl3DOsbIlyXH1Oc9EVGY5G
yVZUI4Wzpss6A5BlmjrRaS8fWwzIFeOT2MH7/xaNE0dvyfgWoOwos11nb1yUETJVRQSDD3OTQ+Dm
WTRSybEJ3IiM2psTIxinqsFz4JtAh+NCd4FkwVJJWxfhaOfismEURQ0yLJVjD8jKn/DcEkph8qVW
frLy6vgNpqwQTw/JY8Ah+FMGCz1WcWuW5+iyaZOQhvBmRjFE2QrlsEDJZVRiCzchmUROCuOowdf4
kgJjuNMpIKjCia6sooy19Yz6WSArlEhjoiK0IbZ7+NjS1WZOsGzIYiMR/WhUyEwZKS+WFoMjN4Yj
axGQvQix2uzi0S0YaIZM6d1U08VQDuEWQOXJ2cNIAYg+sZIZxYdPZ8MkfADA7aDTrKkvqPUTVset
qFIGhheystdNtLGgmMxPWpytWMKnqh0gE7oyoQPHYScDXgMWDVJ0xD5oTM0zWhmxKOwriHJUwoQH
qTUWGBQmQX80kqtVmP3RfN+vuL6P6aS+eln9bf12fZ+i0yn8+Zb4Ya+boFncdt+G/KXcfxljFoq4
HRIMtzh4HXETnFYJbjhkszbhIueLNfLmQrtJqJaXFHAsNsV1+rz+vTvLCvFunpiuLcabh8NZ6OSs
PewJ5zldcw9m++8UwpVx0fszKvwpmrfIgai/c1ZedVjt4OOeB4duw8JQd7X5efl5YAYw7Ch9mURJ
HJrno+bk7nThWOn6J8+jfh0HMowLokd2Mr+cpZEpODoMViXEwxlVo/iazMxykV41M5ymzJPUSwOo
0ZTs6ot0az7dBhvx9YcVfcfas9ohsGbQxpS9WG7C3ngZhvhhdk7mM6eEwQEVkEYdxgazgeqn4jLq
iW8JescPbz6XPXGUQov1ck1r1XYoNVL6NS3zU0Qg8DhHKI8NGB6LoDZg3a8uOVDXechOA4/SSNZe
oevsJnIWnwGZ2L/JYFas8y32gEIEjaFdh6NezUGdiVu4N01FS2P2X0zVFV88m07C9IjNCoUaTf8u
AWl9LEMEEbgrx/OdfpCLF/g6TUY0Nt/gkvYw6wXKPrN6V0K9FKxQNCdfaXrlOLOMd/Z9forcuUIK
S+8kQQfaF5wO8wPkMkZDfFI8UjFpLxPaNLa7x4g+v69C8LDuss9mc8HkWYiWcOWujftu0bUtuIYu
1krN0QoUTersni3FmJ5a2dCcU4ef9jur54e5HH1zni2afw06Ea0YbiwHJ1mDAMGwwwX0WDZRkuDO
309hwf130ip8uTJlKGc4WOphmzBDPlC3UWmX20u36m/RMZyfef5jGTP6g0HwtAfH7IbUsIKQA/GW
Pc18O7ochGQPukPYzJDzXnT3x06TVcGC4Hf/BaQ5fHOCfDJITs0Dy8a7IjBABQir0llkjEk+2ygz
wh+v052BerDgM3n6BKUyfPM8Tev7Sm0aWYWEf7++Xi57gYMKH0ySanBg2zDS3VZRtiZ4WEwbUfP4
9FLsNkPxAD1cv7DfSAYiHANpkScE6E/R+/IMyyVFvcsSKonRLYbV5UjArf94ekkTwMF0jU+GKg7G
GwEWlL5pP3wR14fc4iGGm06jn6bzQqMG+mc/exPxAkY5APqgRgXh0iMDUI/GDD8Q+ECvgLE/fPya
9d1ett6SNZ4vUkUBi69hPenQLSTXIej3gyMw0vFK8MNnVZMmHKRgjm21soImeCY221+vX5ERVEfo
cp0TiBy2KsWFFyN4ktBmM8M5HT2xKzpBlvYI6o6l4n7GJsf0zzaPqs+bnnOk8NA/gJ4Eg85Lh7jr
o9UANqPuP2G/hl7J/QN/2fMfMZ0oMnOW4NzD5a4SQd6UVbXPU5hIQylsGrT4zpOmayY3Be7t9GqV
lM0pF1yuthhXMZIEhWb8McUZIgIirCCIRBGSJoHOdEZ9mYwO8MOQgFjARCxkPa5vmtrKqe8d8N0L
8x1gsQGMBBFAX4oHUJCCxGEBYQtaDL12xhxfFVM96QYHJDuz3zI5EnjCtkDZDKE3p2hmxBUjEYjE
RYqUJQhljJmMzqyVJjqEMIendUMg+wsw+cpBQ91qkaUW6RLpc1mK0k5kea/AojS4Gu9J4+3cX3Ho
kmknFUmqMSlYpvB6Tu6AcxRdrUpBq+VL/25WrdZtN1+ro1suVw2jt3pjV+AeGIzqukYoDxTtF9TH
YUOSHmxEg0Q1H40EueJKy35Pz/P+fnqCPYuYUXFxGqpPbgHMkpcvfrESqSR6zZIdYW129neuBbok
y7YJ7qqVUElBX0kMd3jgIkDxVDQOdoa8IZXgI1CTLupmJcbECTp6DCpKe+EeeUdnZCWodsAR1m4g
yDYCuM9RAiHMgCBM4OEG8Cw7RVt01TZKE1TeYs3rhTr2MNSmsDZrIsy8tJtDWaVV2ci0FhUiYr62
SomrdUtGIKM1LRFVUVVEVVVEUYitJQoKIiIgxgiiqqysWNSiUXntYDD3EbtiW1liS+pDLiuwakss
tEQKfynO5Ej1Ggs1izos0wsGBoHjra+HoDlSKXOEXRbBBcDAlxMTViQoggqzZkMoyhx4ddnf0QPn
xd1IGYkPPtWiHs718jqIUyA5YyJaEAyyDrkFwUmOpXc4RUOHiyJ1zillxHsJqBo1yWwsFTWjgqgi
MBwJRCQSApZQ0wdd400DFruNLGQZnDhjSFKm3UkBkA8tZaLWbxQm6JdY1QoC7Wie4xtktGUQkWyQ
zrTADvo1MKeDzTSFJrhm1RdzuFjC5YQqxayFs9u/bcSRCGC6DOSrxjLjzaH2LNRDXAwwTuA4sLTW
SEGZCVphmleNBL2nhpeBQWkTCSNl8DjAVLxgEOffAggMBVxwWXYTMoO8N1tKyo2Q+S5vGCH6EXvl
sVkEZHWCFRRCFydrdAihYSXbww3630mwdSeLN83HdBjGcpUsVHTKmiQNJLQFgYm711VRBRpBZgEj
WzHsTBr9RYBuo2xsPPkOSgJFkiho3t4molLGCnxLhtyHCZMikFHlBq41jCsMXSSd0lGAYczS9lh7
QyUy4C9buDeb5MwkEiAsIlN22Z1snQ5khYDrlUIikZ8jCcenQQYOQYAbQp1YFzm0Gfx8l05GDIYp
IWlC0Zp5QD6SUhHxiGmvCXLnVskh9ZVHM4BaxoM4B3CSxXiHvqGzCFu4b/aDDVe9fiCxIvb5eikI
DAanfYUyMRUVslVcN013AXmXPlhij1DSCIMEBQ3Qs8nuq2RjIPrYNOQahM4TXz1YvZAYMTnShGKa
oUKiEQwKfNyUMPcMwWArF4fGyEgdcBkgaAe8C+WqSYXkMXUrPAI54WLocWLBj9PAgkj1eobShB8C
2QkTnmW5pYK7FqEZGs1G7wh8eepQMARSzIccs0W9AcdcobUqVJPXEgRSQoH0xCmj2xMkJx0ctiMJ
sWYXgZRiIqxBCSU4eDEq/UTzIGN4zAsMVg1JiPuVxWwDNnHtyFBxU0eeErFMRpLipzMJu8R4pRQh
QKKwVFUSGe2E5J2yB6BVAe+eg9faWSoMYiCxSMGl7xoCSESY9JrVmzszJMJHEkrxpJz7G2ZKeCOB
3soGOt6/So5GT5Kapm07aUam1OsiZBWDIL/0vS5LIEXgBWZMuEWtCCglzdsAsK5UKMhzYHTScoqC
jFVWKRGEkz21Tj7/gUEXvSf0Dw9OchbU8hM8RD/oprplUMOGW3WSSJSnVDWcWNCkt6DnxNSn/IgA
TZx3DbsJA1uVWrw0AbHGXDGmQMm3LgWihoGrKA34SkqB2FK1qpIv8tAIqyYFxMCZC22ioxcjkNK7
kc+BSiEZML3IKndYTSuDcCyyEmg1tewLkRYWJg02oq9EQuO5LQqURXQCzAj2Wg+53KFxlzJkKBGo
UymScJDs3SdhZWkoVBrbylBqTQkzSKFIQsaAd3QSAGjG4N+y4g6LBewQop75ZVc+kpBFEFGLF66F
ETGkWUTypKYUBGSMQN04CnFhbwsVIJ2U0Y0KFeRswrmtYaUWOh129GzNjbDDMuZawX8rrskgawHa
IhQrDxpMMsabGZCKwKKjKJY1O159zUmEqD8FepE0yM8ZpBqgCmg5lckrDSGTDRgWmhk42IfJgfF8
uQaVYIRYmDaseppmEIvRixKoV08dK9MyB6fdyMwdTnsUW1tVrCFiwgKMLBBOdgRIjmx7GaEEYaGR
C892a0hSj2EBkYwTbM7mOAKsQr2md+gLJPB5ALyEDoTKR8QpKq9RxNJjfcktjSEHemEwFXB8nwFt
WIwUEEgIjEAVYKAiCogYhnDXHYBN5Rdm3fDVg2qVEhklMYc5zoaA0RWTtHyOIILFh9lspQpIskqM
BGAMZQZB5zMO2KeQrMTlPqHaYaq3QJFCoGJQrM8oZ0O0jsbZa7SWgriV3XVb1GgkvUJzoSE0FDRy
suPbhq8VYoKKJDDbsxnE3nvli+/sDJmPf67B0nZQ7wn2xkk0fhzHjMSmNibSCnEkl7xChyJqBKXb
fMMuc9qKHwiJgs1bHynC5HwBUwRcjO7pdie0ZY+9GwNhHsGSG4ThK6nraLNV0LUfQPegP5pU5s9o
eQxmcLoULpMqDil5aigRQWCatmNgzGyMUHwtE3mtouSi26AN0ah1DTVIyDD503zf7F9oVIiGQnkQ
3B4JGSU4PFEGY1S1FLJzJ2lvExZ4aA6yyMEF5fNWtWgyYY4jKKa8jnfj4FvcCruSBkJvMskUsNQh
gKYAST8qgEreE9s14lxkrGkjDlCFhgXOBD1klbL7eog2cJOvbt8qRdb2sTGKKghaM64rzcjgH6bL
DnDwNeiy0GwqZClId3LIUSc6Xqc4kol14+VxMEXDh6vIsZoYjlDCmYXTLOIiGDB0VLWzEwL2AhG6
DGmYD51DCD3Xa5y5EhN5FunDwM0GrYSCXu2HWJmpZuPeoadBEFjDBiKjvNB+5m3E8PMQeZ6BXIXB
G658oWToEKZSNCMTeHmNAb0NGj4vGYaDbYO4GiBpgaDaiSo8sGkZSBTLEXPKtkijosUIMDMIXlGM
jC+U6hhqwvcYpKjzJJQSQrWjexQyOnPBE4i2ohYTIF2oGUX4V6rQir+gskpE5oUcWDkUsBPLgLD0
ZDwKeJARLjIDEPyKF6QgP3B5OQjIci4oPAtJUQxLGg4x2SHHgsWICJBnpj8PcAzbQkhXIBHFAm0c
TLMTBNcSdn7+LuxJE8rrkrp1V60aP0yOh8Bl5oBQJCORgcWXCYhgSM2uAYmW7pTnRd+IwOEwRoai
DBKEFKFKMMjDzmkJZhFh3U1jBYki/Xricq9IJaOHCqccJWkYINwE4m8OMYHWUeI7MTaJORhScQBA
poIIadbRO6JWxRAkoObfWRMpuTSSbOmVGmfix/yDl6lj5dwGAHZihfYahkRnqK69rIBKI80sxJbc
T1zrzA+0TEzUyEH/ytCPKbPzWAt2Fg4TD4+boywR0ExxG4MsMg2PXgJDHzYBcTEvQYgztZMBTAxA
xEMUbsrKEDMSSRTfVR0UVKUAYkbmPCX6F549+/e9gyZwXDV1l4hCyrHmvABtY6yCs5IMtkoD4qFW
PcCVlA2JI+NrpSQMlef0GJ9FLWbezgiyhGT5kkhsDwSJ3OB0FUbrkSTcgbwphSQjBBC+zzehLZnO
lp2BBopggyZ6iPYaO0C5eYHxG6Q+NHRUIZ0mHudbCn0TNtGpXzTzwCf/F3JFOFCQUdUOYg==

--=-qHrXyRTX9UKeEGtYju5m--