[Teammetrics-discuss] Converter for mboxes (Was: Debian mailing lists archives as mbox)

Andreas Tille andreas at an3as.eu
Mon Aug 29 12:45:52 UTC 2011


Hi again,

any news about the mbox conversion?  If it is a matter of time on
listmaster side I'd volunteer to run the code on the real mboxes with
the real skip files (if you tell me where I can find them).  I would
also volunteer to make the interface more friendly for your plans if you
would simply specify what to do (for instance where the converted mboxes
should finally end up in the file system and what your plan might be to
make it accessible to non-DDs).

It also might make sense to keep on the discussion what people think
about more general access to the archives.  Do you think we should do
this discussion on debian-private or debian-project?  IMHO it is a topic
for debian-project.  Because the outcome of the discussion might
influence the output of the conversion tool we might consider the
discussion even now.

Kind regards

      Andreas.

On Tue, Aug 16, 2011 at 04:34:19PM +0200, Andreas Tille wrote:
> Hi,
> 
> as it was requested by listmaster in this longish thread we wrote a
> converter which strips certain tags from mboxes of lists.debian.org.
> The code can be found in the attached tgz.  You can find it as well in 
> 
>   git://git.debian.org/git/teammetrics/teammetrics.git
> 
> in directory mbox-tools.  The actual filter is mboxfilter.py.  It takes
> an (unzipped for the moment - feel free to ask for support of gzipped)
> mbox and outputs a mbox with the extension '.converted' - not very cool
> name but you did no specification.  It's easy to adapt to your needs
> (better name / stdout / whatever).
> 
> For the moment it takes a single file for specifying the Message-IDs
> which should be deleted.  This is called messageid and contains *only*
> the Message-IDs (not the prefix Skip-Spam-Message-Id: as written below).
> It is not clear to us whether this prefix is always the same - this
> sounds not probable because it would be just redundant.  If the
> exclusion files are featuring those prefixes can we safely assume that
> we get the Message-ID with the following regexp:
> 
>     ^Skip-.*-Message-Id: (.*)$
> 
> ?  If not please be more verbose or tell me where I can find those
> exclusion files on master.
> 
> Moreover you were speaking about more than one exclusion file.  Do you
> mean *several* exclusion files per mbox or just one per mbox which has a
> defined naming scheme?
> 
> Regarding the fields which are taken over into the converted mbox: In
> the beginning of mboxfilter.py you find a list HEADERS which specifies
> those headers which are taken over.  I also added a list
> possible_HEADERS which contans fields which might make sense to take
> over for certain reasons.  This is just for documentation currently.
> 
> I tested the filter with random mboxes (from different lists, different
> times, different sizes):
> 
> 	debian-accessibility.200406
> 	debian-announce.200902
> 	debian-devel.199808
> 	debian-devel.200704
> 	debian-devel.201106
> 	debian-jr.200609
> 	debian-med.200609
> 	debian-ocaml-maint.200408
> 
> using the messageid file in the attached tarball and found it working
> for these.  This messageid file was created using the script
> mbox-potential-spam-ids (just to have some input) and I checked the
> result by mbox-diff-check to be able to detect some potential problems.
> My tests did not revealed any unexpected things.
> 
> Please tell us how to proceed from now.
> 
> Kind regards
> 
>          Andreas.
> 
> On Thu, Aug 04, 2011 at 11:32:42AM +0200, Alexander Wirt wrote:
> > Sukhbir Singh schrieb am Thursday, den 04. August 2011:
> > 
> > > Hi Alex,
> > > 
> > > Can we have some prototype/ format of the Message-IDs that you want us
> > > to strip? It would be beneficial for both sides because then we can
> > > show you what we will be handling and you can tell if something else
> > > needs to be taken care of.
> > Sure. We have several files with entries like:
> > Skip-Spam-Message-Id: <4610e762.1f8f12a6.0218.7af1 at mx.google.com>
> > Skip-Spam-Message-Id: <8600e4c3dd4c62fb51f343ac020608e3 at gmail.com>
> > Skip-Spam-Message-Id: <CA287EE3.7684.AC15C2D5 at localhost>
> > 
> > if would be best if the converter accepts a message box and several skip
> > files. I'll write a wrapper that does the dirty details on the filesystem.
> > (Explaining everything in detail would take more time than writing a script).
> > 
> > Alex
> > 
> > 
> > > 
> > > Thanks for the help,
> > > 
> > > -- 
> > > Sukhbir
> > > 
> > 
> > 
> > -- 
> > To UNSUBSCRIBE, email to debian-devel-REQUEST at lists.debian.org
> > with a subject of "unsubscribe". Trouble? Contact listmaster at lists.debian.org
> > Archive: http://lists.debian.org/20110804093242.GM3348@smithers.snow-crash.org
> > 
> > 
> 
> -- 
> http://fam-tille.de



-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list