[Popcon-developers] Traces for P2P pub/sub project

Mon Jun 15 22:38:13 UTC 2009

Hi Bill and Petter, and thanks for your replies!

> Hello Spyros,
> 
> I think pushing the use of P2P to every bandwidth-intensive network
> transfer is a good idea. However did you consider the security
> implication of allowing computer A to know which packages are installed
> on computer B ?

We do not focus on security in this phase of the project, but just on 
the algorithm for decentralized pub/sub. Security will be a definite 
addition, if we come up with a promising algorithm in the current phase.

>>   In this context, I would like to ask you if we could get hold of your 
>> raw, pre-aggregated traces. That is, the listing of installed packages 
>> explicitly listed _per_ _user_. Of course, any personal information you 
>> may be collecting (user name, IP, etc.) can be anonymized, we just need 
>> an arbitrary user ID. In fact, package names can also be anonymized if 
>> necessary, although we would prefer not.
> 
> Due to the above security implication (and basic privacy expectation of
> popularity-contest users) it is not possible for us to publish a
> per-user list of packages and unfortunately it is not possible to
> anonymize in a information-theoretic safe way the packages and the
> users.
> 
> Suppose package #123 has 141 installations and 121 votes: you can just
> look up the aggregated popcon result and get a small list of canditates
> packages so it is not really anonymized

Ok, i could _guess_ that the packet with id #123 corresponds to packet 
xyz, but then what? I would just know that the users that (I _suspect_) 
have installed xyz are the ones with IDs #4013, #6405, #8667, etc., but 
I would have no mapping from these arbitrarily chosen identifiers to 
real people. Would this breach privacy? :)

Btw, if you remove a random subset of users, e.g., 20% or 30%, without 
even revealing what percentage you removed, guessing the packet based on 
  the number of votes would start becoming a very wild guess. Needless 
to say, it is way out of our intentions, but even if it were, this data 
wouldn't get us that far.

> Suppose that all systems which have foo and bar installed 
> also has baz installed. If you guess that a popcon submitter has 
> foo and bar installed, then you deduce they also have baz. 

Yes, this is correct. Our clustering would try to infer such 
relationships, to optimize the linking between peers based on how much 
overlap they have in terms of installed packages.

But again, inferring that whoever has gimp also has libgimp2.0 
(_guessing_ that package #65375 refers to gimp and #2738 to libgimp2.0), 
doesn't buy us much.

Basically, providing us with such anonymized traces does not sacrifice 
privacy. Instead, it makes a contribution to research and to a potential 
next generation software distribution framework.

> But surely you are interested by some statistic on the data rather than
> on the data itself. Maybe you could give us what statistic you want and
> a Debian developer could compute it for you without giving access to the
> data. The developer would have to check the output would not breach
> privacy.

Unfortunately we would need the actual data, or part of it. The whole 
point is to let peers self-organize in a P2P overlay where they select 
neighbors that have highly overlapping sets of packets installed.

Then, when some package is updated, peers would have to intelligently 
propagate this update (or a notification about it) to all users that 
have that package. So, aggregated data would not be any useful to us...

To Petter:  In your mail you mentioned about per user data having been 
made public in the past. If that data is still available, could we maybe 
get hold of it? As mentioned, our target is to design a self-organizing 
algorithm. We don't really need to have the latest data, but some 
*real-world* data that represents the level of overlapping in users' 
installed packages.

If the current traces could be made available, it would be best. What is 
important to us is the number of users. We target a very scalable 
system, and evaluating it with a large dataset is crucial. If you think 
it could be possible, we would highly appreciate it!

Guys, thanks again for your time!
-Spyros