[Popcon-developers] Traces for P2P pub/sub project
Spyros Voulgaris
spyros--alioth at cs.vu.nl
Mon Jun 15 22:38:13 UTC 2009
Hi Bill and Petter, and thanks for your replies!
> Hello Spyros,
>
> I think pushing the use of P2P to every bandwidth-intensive network
> transfer is a good idea. However did you consider the security
> implication of allowing computer A to know which packages are installed
> on computer B ?
We do not focus on security in this phase of the project, but just on
the algorithm for decentralized pub/sub. Security will be a definite
addition, if we come up with a promising algorithm in the current phase.
>> In this context, I would like to ask you if we could get hold of your
>> raw, pre-aggregated traces. That is, the listing of installed packages
>> explicitly listed _per_ _user_. Of course, any personal information you
>> may be collecting (user name, IP, etc.) can be anonymized, we just need
>> an arbitrary user ID. In fact, package names can also be anonymized if
>> necessary, although we would prefer not.
>
> Due to the above security implication (and basic privacy expectation of
> popularity-contest users) it is not possible for us to publish a
> per-user list of packages and unfortunately it is not possible to
> anonymize in a information-theoretic safe way the packages and the
> users.
>
> Suppose package #123 has 141 installations and 121 votes: you can just
> look up the aggregated popcon result and get a small list of canditates
> packages so it is not really anonymized
Ok, i could _guess_ that the packet with id #123 corresponds to packet
xyz, but then what? I would just know that the users that (I _suspect_)
have installed xyz are the ones with IDs #4013, #6405, #8667, etc., but
I would have no mapping from these arbitrarily chosen identifiers to
real people. Would this breach privacy? :)
Btw, if you remove a random subset of users, e.g., 20% or 30%, without
even revealing what percentage you removed, guessing the packet based on
the number of votes would start becoming a very wild guess. Needless
to say, it is way out of our intentions, but even if it were, this data
wouldn't get us that far.
> Suppose that all systems which have foo and bar installed
> also has baz installed. If you guess that a popcon submitter has
> foo and bar installed, then you deduce they also have baz.
Yes, this is correct. Our clustering would try to infer such
relationships, to optimize the linking between peers based on how much
overlap they have in terms of installed packages.
But again, inferring that whoever has gimp also has libgimp2.0
(_guessing_ that package #65375 refers to gimp and #2738 to libgimp2.0),
doesn't buy us much.
Basically, providing us with such anonymized traces does not sacrifice
privacy. Instead, it makes a contribution to research and to a potential
next generation software distribution framework.
> But surely you are interested by some statistic on the data rather than
> on the data itself. Maybe you could give us what statistic you want and
> a Debian developer could compute it for you without giving access to the
> data. The developer would have to check the output would not breach
> privacy.
Unfortunately we would need the actual data, or part of it. The whole
point is to let peers self-organize in a P2P overlay where they select
neighbors that have highly overlapping sets of packets installed.
Then, when some package is updated, peers would have to intelligently
propagate this update (or a notification about it) to all users that
have that package. So, aggregated data would not be any useful to us...
To Petter: In your mail you mentioned about per user data having been
made public in the past. If that data is still available, could we maybe
get hold of it? As mentioned, our target is to design a self-organizing
algorithm. We don't really need to have the latest data, but some
*real-world* data that represents the level of overlapping in users'
installed packages.
If the current traces could be made available, it would be best. What is
important to us is the number of users. We target a very scalable
system, and evaluating it with a large dataset is crucial. If you think
it could be possible, we would highly appreciate it!
Guys, thanks again for your time!
-Spyros
More information about the Popcon-developers
mailing list