[Debtorrent-devel] Fwd: BitTorrent Protocol Expansion (Google SoC)

Cameron Dale camrdale at gmail.com
Fri Apr 13 01:52:47 UTC 2007


---------- Forwarded message ----------
From: Cameron Dale <camrdale at gmail.com>
Date: Apr 10, 2007 12:16 AM
Subject: Re: BitTorrent Protocol Expansion (Google SoC)
To: Anthony Towns <aj at azure.humbug.org.au>


On 4/9/07, Anthony Towns <aj at azure.humbug.org.au> wrote:
> On Mon, Apr 09, 2007 at 02:43:16PM -0700, Cameron Dale wrote:
> > Since the first step is variable size pieces, I'm wondering, are we sure this is
> > the way to go? What about my alternate suggestion on the wiki for a smart
> > ordering of pieces that would minimize the wasted bandwidth? Maybe we should
> > generate some statistics on this to check?
>
> lenny's active, so monitoring sid and lenny for a week should be a good
> start. The size of all debs in a Packages file (arch:i386 and arch:all)
> is 14,337,091,622B by my count, so around 14GB. That should give you a start,
> afaics.

I did some preliminary analysis on the efficiency of this strategy.
Using my install as a test (amd64 desktop), I examined different
ordering strategies for pieces in the main-amd64 Packages file (which
accounts for 1486 out of the 1551 packages I have installed). The
total size for this configuration is 1.1489 GB. I tried out different
piece sizes from 32 KB to 512 KB, which gives the following number of
pieces for the entire torrent:

Piece Size 32 KB   466630 pieces
Piece Size 64 KB   233315 pieces
Piece Size 128 KB   116658 pieces
Piece Size 256 KB   58329 pieces
Piece Size 512 KB   29165 pieces

I would say that over 100,000 pieces is getting quite large, and we
would probably want to go with 256 KB or 512 KB piece sizes. The
attached graph shows the results for 5 different orderings. The best
is  sorting by Priority, then Section, then by Source package (I guess
because I install multiple packages from the same source package quite
a bit, which is interesting in itself). The results show a wastage of
at least 10% for 256 KB, even for the best strategy. Not too good, in
my opinion, but what do you think?

I haven't tried the sorting by popularity (hard), or by dependencies
(harder) yet, so they might be better. If you have any other ideas for
statistics you'd like to see related to this, let me know.

> The unavoidable drawback, afaics, is that piece boundaries will almost
> never start on a file boundary, so there's no way to reuse the sha1 info
> from the Packages file as the "piece" checksum info. Presumably we're
> talking 1MB piece sizes, so that's an extra 14k sha1sums that need
> to be calculated and distributed every pulse, by sha1'ing the entire
> archive and adding a new 14k*20B so an extra 280kB per suite. But it's
> the regenerating sha1s for something like 100GB of the archive twice
> (or more) a day that sounds impossible to me.

Maybe, I'm not sure what kind of machines/workloads we're talking
about. Does this make the implementation unworkable? Should I bother
trying to check other ordering strategies?

Have you thought about how this SHA1'ing will be done for the variable
piece sizes we've been talking about? For small packages it's already
there, but what about generating multiple piece SHA1's for larger
ones?

I guess the easiest would be to pick a piece size and stick with it
for everything, rather than use a bigger piece size for large
packages, and a smaller one for small packages. This gives the
following for the 19091 packages in the same main-amd64 Packages file:

 Piece  |  Total # |         Number (%) of Packages that are
  Size  | of Pieces|     1 piece    |  2-8 pieces  | > 8 pieces
------------------------------------------------------------------
 32 KB  |  476464  |   5388 (28.2%) | 7789 (40.8%) | 5914 (31.0%)
 64 KB  |  243926  |   8098 (42.4%) | 7008 (36.7%) | 3985 (20.9%)
128 KB  |  128265  |  10814 (56.6%) | 5906 (30.9%) | 2371 (12.4%)
256 KB  |   71209  |  13177 (69.0%) | 4586 (24.0%) | 1328 ( 7.0%)
512 KB  |   43331  |  15106 (79.1%) | 3321 (17.4%) |  664 ( 3.5%)
  1 MB  |   29920  |  16720 (87.6%) | 2053 (10.7%) |  318 ( 1.7%)

This is again looking like 256 KB or 512 KB are the way to go, as any
smaller creates too many pieces, and larger means almost all the
packages are a single piece. Again, any other statistics you'd like to
see here? I had a hard time figuring out which to use, as the
mean/median number of pieces per package are not really informative
here.

Cameron
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wasted_download.png
Type: image/png
Size: 7292 bytes
Desc: not available
Url : http://lists.alioth.debian.org/pipermail/debtorrent-devel/attachments/20070412/16a8c1e9/wasted_download-0001.png


More information about the Debtorrent-devel mailing list