[Debtorrent-devel] Fwd: BitTorrent Protocol Expansion (Google SoC)

Cameron Dale camrdale at gmail.com
Fri Apr 13 01:48:58 UTC 2007


---------- Forwarded message ----------
From: Cameron Dale <camrdale at gmail.com>
Date: Apr 5, 2007 2:23 PM
Subject: Re: BitTorrent Protocol Expansion (Google SoC)
To: Anthony Towns <aj at azure.humbug.org.au>


Anthony Towns wrote:
> On Sun, Apr 01, 2007 at 02:19:27PM -0700, Cameron Dale wrote:
>> "a lot of packages are too small"
>
> I think I did some stats a while ago trying to get a handle on this
> to work out piece sizes. No idea what I did with the data then, but
> redoing it now seems straightforward. If I use /var/lib/dpkg/available
> (which is in Packages file format):
>
>     $ sed < /var/lib/dpkg/available  -ne 's/^Size: //p' | sort -n > foo.csv
>
> and run that through gnumeric's "statistical analysis" stuff, I get:
>
>       Mean                               757,299.01
>       Standard Error                      26,260.22
>       Median                              94,697.00
>       Mode                                   792.00
>       Standard Deviation               3,546,976.60
>       Sample Variance         12,581,043,007,634.50
>       Kurtosis                               633.33
>       Skewness                                19.95
>       Range                          161,312,492.00
>       Minimum                                736.00
>       Maximum                        161,313,228.00
>       Sum                         13,816,163,106.00
>       Count                               18,244.00
>
>       95% CI for the Mean from        705,826.52
>       to                              808,771.50
>
> A mean package size of 757kB with a std-dev of 3MB is probably noteworthy;
> the minimum size of 736 bytes compared to a maximum of 161MB is probably
> likewise interesting.

I've done some too, but I don't think I've saved them (they're simple
anyway). I think a lot of the evaluation of different proposals can come
down to statistics though.

>> "Proposal A"
>> "communicate all torrent information in the Packages file"
>> "pieces can no longer be numbered"
>
> The latter isn't actually true -- if you have a Packages file like:
>
>       Package: foo
>       Size: 5341873
>       SHA1: 38170c08cb458fd4879c34b6f608294c50312bbb
>       SHA1-pieces:
>        e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576
>        7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576
>        a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576
>        9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576
>        5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576
>        ccf271b7830882da1791852baeca1737fcbe4b90 98993
>
>       Package: bar
>       Size 72856
>       SHA1: 9425fa8de16f6283365f6bee87f405da16a203e6
>
> then you have 7 pieces all up, five of size 1048576, one of size 98993
> and one of size 72856, and you can number them in order, ie:
>
>       0 -> foo[0]
>       1 -> foo[1]
>       2 -> foo[2]
>       3 -> foo[3]
>       4 -> foo[4]
>       5 -> foo[5]
>       6 -> bar[0]
>
> You're depending on your Packages file being in the same order on
> different hosts, but that's more or less ok anyway. The major thing
> that changes in that scenario is that _all_ the pieces can be "short",
> rather than just the last.

I think a same-ordered Packages file is something we definitely can't
depend on. If I download the Packages file today and download bar, then
bar updates overnight, and tomorrow you download the new Packages file
and try to download the new bar from me because I tell you I have piece
6, there's going to be problems. Things get even worse if a package
grows in size and requires a new piece, then all packages after it in
the Packages file are offset by one.

I think the only way to number the pieces might be to use a number once
only. So in a Packages file with k pieces, if bar gets updated over
night, the new bar piece is now numbered k+1 instead of 6. The Packages
file could then be like:

>       Package: foo
>       Size: 5341873
>       SHA1: 38170c08cb458fd4879c34b6f608294c50312bbb
>       SHA1-pieces:
>        0 e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 1048576
>        1 7448d8798a4380162d4b56f9b452e2f6f9e24e7a 1048576
>        2 a3db5c13ff90a36963278c6a39e4ee3c22e2a436 1048576
>        3 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 1048576
>        4 5d9474c0309b7ca09a182d888f73b37a8fe1362c 1048576
>        5 ccf271b7830882da1791852baeca1737fcbe4b90 98993
>
>       Package: bar
>       Size 72856
>       SHA1-piece-number: 18001
>       SHA1: 9425fa8de16f6283365f6bee87f405da16a203e6

where the new numbers in the SHA1-pieces indicate the piece number, and
similarly for SHA1-piece-number. Now the piece number uniquely
identifies the package and version.

Since these numbers are kind of arbitrary, I was suggesting we just do
away with them and instead use the SHA1 as the piece number. Then
instead of saying "I have piece 18001" you would say "I have piece
9425fa8de16f6283365f6bee87f405da16a203e6". The only reason to have piece
numbers that I can see is for the BITFIELD communication (more below on
that).

BITFIELD is useful though, so maybe numbering should be kept. The key is
whether it can be unique and not grow to too large a size. Some more
statistics are needed here, ;) such as how many new piece numbers would
we need per day. There would probably have to be a point too at which
numbers could be reused, say if they haven't been used to indicate a
piece in more than a year. I'm not sure if the Bitfield is worth this,
but it's something to consider.

>> "Cons"
>> "...difficult to find rare pieces"
>
> A simpler approach might be to communicate "I'm planning on downloading
> the entire torrent" or "I have downloaded the entire torrent", and
> prioritise those peers. We have a bunch of well-connected mirrors around
> already and I wouldn't expect that to change, so there's no reason not
> to make use of it. And we have lots of people who have a full mirror
> for their architecture(s) too who would participate in a p2p scheme,
> so if you had a bit to flag those hosts, you'd probably be pretty okay.

Good idea. We'd have to be careful though not to create a system where
the mirrors get overworked by this, but that should be easily possible.

> Another approach is to have the existing mirror network act as a
> backchannel, so that if you can't download foo.deb from any peers in
> reasonable time, you grab it from a regular http mirror instead.

I like this solution! We're not designing a replacement for http
distribution, so it will always be available too. Maybe some statistics
on whether the largest packages in the system are the rarest too, as we
may just end up falling back on http too often. Still, if they're rare,
it can't be too much traffic.

>> "how do peers communicate BITFIELD information of all the pieces they
>>  have when the pieces are no longer numbered"
>
> Probably by sending a lot of "HAVE ..." notices?

Yeah, like a few thousand, every time you connect to a new peer? That
would probably mean a lot of wasted bandwidth and connections. If we use
the unique piece numbering I mentioned above, it could also grow to the
point where we have bitfields that are too long. BitTorrent currently
sends bitfields as "001010..." indicating that it has pieces 2 and 4,
but needs 0,1,3,5. So, our bitfields could grow to the point of being a
100 KB in length or much more. It should be pretty easy to convert this
bit string into a more efficient Hex or even binary representation to
save on communication costs though.

Just so you know, I recently received word of a very similar product to
my Enigmail proposal to Mozilla that was just released (just last month,
in fact, which is bad timing for me :( ) that will probably affect my
chances of acceptance by them. Therefore, I'm even more committed to
(and more dependent on) this project, and (don't tell Mozilla yet, but)
I would choose this project if both were accepted and I had to choose
between them. Just something to think about.

Cameron
-------------- next part --------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFGFWjiDx924g0gNq0RAvL3AJsGINzWZ6t78s5vuM06UpawfEiSJQCfbUu6
U3fWfqBdqDNko7ilA1VJN5M=
=7sWH
-----END PGP SIGNATURE-----


More information about the Debtorrent-devel mailing list