[newmaint-site] Matching emails of same contributor ala carnivore - Was: Re: contributors.debian.org milestones

Olivier Berger olivier.berger at telecom-sudparis.eu
Thu Nov 21 14:53:16 UTC 2013


Hi.

(FYI, I had sent this only to Enrico while Alioth lists were down, but
hopefully everyone gets it now.)

Initial mail :
-----------

I'd like to try a contribution to automatically match different
emails of a same contributor, based on the Debian keyrings, in a way
similar to the UDD carnivore table.

It would allow to automatically add email Identifiers of the same
Contributor, based on the various ids of a same public key in the
keyring.

Actually, there's some code that can be borrowed in the ddportfolio [0]
to do so (ddportfolioservice/model/keyringanalyzer.py), so I hope to
have something real soon. The code is originally under Affero, but I
foresee no real problems for reusing it.

It could be run as a management command, much like the uids injection,
directly as links between Contributors and Identifiers, or stored in a
cache in the DB to look it up whenever a mail is added as an
identifier. The latter could be better, if we consider that people
mention many email adddresses in the GPG pubkey ids, that will actually
never be used for Debian contributions, but I have no statistics to
prove this assumption.

Tell me if you think it's interesting in the context of the identity
management, hoping that no one else is working on such an idea.


Maybe we could think about some other tighter links between that
ddportfolio and the DC app, but I'm quite sure this was probably
mentioned before...


Best regards,

[0] http://debianstuff.dittberner.info/gitweb.cgi?p=ddportfolioservice.git;a=summary

----------
Then, I added some days later :
----------

Here's a first attempt at storing in the DB the pubkey's different uids
:
https://gitorious.org/olberger/dc/commits/carnivore

I haven't yet tied it to the rest of the code. For the moment, it will
just register 4000+ emails of the 1200+ keys in the keyrings, ready to
be matched as multiple identifiers of single contributors.

Any comments or suggestions much welcome: I'm learning (Python again,
and Django for the first time) as I'm doing.


Enrico Zini <enrico at enricozini.org> writes:

> Hello,
>
> I spent some time thinking about what's missing to get DC going, and I
> came up with 3 milestones. Here they are:
>
> Milestone 1: proof of concept
>
>  - We show only data from @debian.org and @users.alioth.debian.org
>    accounts, which we can safely assume it can be made public without
>    asking first.
>  - We document how to build data mining scripts.
>  - We call for teams to start experimenting with sending data to the
>    site.
>
> Milestone 2: moar data sources
> (possibly taking advantage of the minidebconf in Cambridge)
>
>  - We lobby teams for setting up data mining scripts and posting data to
>    the site. We can help them set these up, but they should ultimately
>    be the responsibility of the teams themselves.
>
> Milestone 3: moar identifiers
>
>  - Get more kinds of identifiers into the mix: emails, gpg fingerprints,
>    wiki names.
>
>    This needs figuring out both privacy requirements and integrity
>    requirements: we need to avoid to open trolling avenues, like sending
>    one silly bugreport a week as debiansux at ownyouftw.troll to get into
>    the list. Identifiers should be somehow tied to reputation that is
>    built up with constructive work: if one wants to have
>    debiansux at ownyouftw.troll end up in the list, they need to earn it
>    honestly.
>
>    Two possible ideas:
>
>     - one needs to have a gpg key with a trust path leading to the
>       strongly connected set;
>     - the initial opt-in is initiated with a mail from the Debian
>       Welcome Team, and they might decide to wait a bit and see when
>       they notice a suspicious identifier.
>
>    But really, different identifiers may have different requrements,
>    we'll see it when we get there. As data flows in from new data
>    sources, we should start getting some idea.
>
>    For example, emails in debian/changelog, since a DD signs for its
>    integrity, can be trusted differently than emails from the BTS, where
>    anyone can post.
>
>  - Identity management needs to be implemented, and this probably means
>    waiting until after the single signon sprint meeting that should
>    happen in January. Too much information is missing now to make good
>    tradeoffs.
>

-- 
Olivier BERGER 
http://www-public.telecom-sudparis.eu/~berger_o/ - OpenPGP-Id: 2048R/5819D7E8
Ingenieur Recherche - Dept INF
Institut Mines-Telecom, Telecom SudParis, Evry (France)




More information about the newmaint-site mailing list