[Teammetrics-discuss] Updates.
Sukhbir Singh
sukhbir.in at gmail.com
Thu Jun 30 19:08:14 UTC 2011
Hi,
repository.update()
As you must have noticed, it is July 1st (here as of now). So we can
now 'officially' parse the June teammetrics-mailing list ;)
I have added the signature metric, so here are the results:
name | frequency | rawlen | quotelen | blanklen | siglen
---------------+-----------+--------+----------+----------+--------
Sukhbir Singh | 77 | 58673 | 998 | 1248 | 1248
Andreas Tille | 46 | 66462 | 946 | 1590 | 854
Scott Howard | 4 | 4318 | 48 | 91 | 91
As you can notice, 'siglen == blanklen' as Scott doesn't have a
signature, it's just `~Scott` while Andreas and I do have one. That
explains the difference in the `siglen` column and perhaps why it is
important. I feel all the metrics are pretty conclusive for a mailing
list. Rest you can observe. Here is a summary once again:
rawlen -- total number of characters in the message body.
blanklen -- total number of lines in the body excluding blank lines
quotelen - total number of lines excluding blank lines AND lines
starting with >
siglen - total number of lines excluding blank lines AND lines
starting with > AND up till '-- '
So 'siglen' is the _complete_ metric.
For the lists.debian.org, I investigated using the NNTP interface.
That works perfectly. We get exactly what we want and it's fast and
doesn't strain the Gmane server (40,000 subjects/ From fields in ~10
seconds). There is only one drawback and that is the obfuscation of
the mail addresses. And that was only in one list I checked. I didn't
keep a check as to which it was (sorry) but out of six lists, only one
had obfuscated email addresses.
So what I suggest now is that we go with NNTP access only. I think
that obfuscation is a rarity and we should go ahead with this. For
starters, you can point me to some mailing lists that you would want
to parse first so I can check for obfuscation. Then at DebConf, we can
take up how to parse these lists or request for mbox archives.
I will be investigating the CGI thing tomorrow.
--
Sukhbir.
More information about the Teammetrics-discuss
mailing list