Google Summer of Code

Wed May 31 15:21:03 UTC 2006

On Wed, May 31, 2006 at 01:50:32AM +0200, Alex de Landgraaf wrote:

[I didn't see this message land in the list, so I'm bouncing it there]

> The tag-approval part of debtags seems like a good place to start, as it
> sounds like it is holding debtags back most. I've played around a bit
> with your process script, let me reiterate the steps to make sure I'm
> right on this (please correct me where I'm wrong):
> - - tags are added or removed via debtags submit (or the online
> packagebrowser)

Yes.  Each change goes in unchecked.  In case a big mess goes in
unchecked, we have daily backups.

> - - the current tags are uploaded to /tags/tags-current.gz, probably a
> daily database snapshot

Exactly, generated every night.

> - - tagcoll is used to diff the tags between tags-current.gz and those in
> Packages (the tags visible via debtags et al)

Yes.  Or actually, those in the svn repo, which then go into the
Packages file.  That patch is what needs to be reviewed.

> - - these differences are (currently manually) either approved and moved
> into SVN (to be packaged, I presume) or rejected, in which case they are
> removed from the central database

Yes.  I generate a patch with only the manual corrections and submit it
back to the database.

This has some extra details that can be interesting:

 - manual corrections are preserved and can be reapplied in the future
 - I found it more efficient to perform manual corrections only on a
   subset of the patch to be reviewed.  For example, I can review only
   the changes to one tag, or review only the changes to all tags in one
   facet.  This provides a common context for all the changes I'm
   reviewing, and avoids me to handle in a single bunch, for example,
   a mail reader and a DNA sequencing library.

> If this was the general idea and Erich doesn't disagree I'll see if I
> can string together a proof-of-concept (try to have the classifier
> review the changes for a single facet), should be fun,

It could even be an instance of the same problem: suppose I have an
oracle that can tell me if a tag chance is good or not; then I can use
it for tagging by asking it if it would approve the patch +tag or the
patch -tag.

[past experience] One problem I had when I tried to use dbacl to review
tags is that in dbacl the size of the training data matters a lot.  This
was a problem because the package data for {all packages with tag A} is
usually much smaller than the package data for {all packages without tag
A}, and that would produce a biased dbacl discriminator.

Ciao,

Enrico

-- 
GPG key: 1024D/797EBFAB 2000-12-05 Enrico Zini <enrico at debian.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.alioth.debian.org/pipermail/debtags-devel/attachments/20060531/687ecb90/attachment.pgp