[Debtorrent-devel] Fwd: BitTorrent Protocol Expansion (Google SoC)

Cameron Dale camrdale at gmail.com
Fri Apr 13 01:53:08 UTC 2007


---------- Forwarded message ----------
From: Anthony Towns <aj at azure.humbug.org.au>
Date: Apr 10, 2007 2:06 AM
Subject: Re: BitTorrent Protocol Expansion (Google SoC)
To: Cameron Dale <camrdale at gmail.com>


On Tue, Apr 10, 2007 at 12:16:26AM -0700, Cameron Dale wrote:
> I did some preliminary analysis on the efficiency of this strategy.
> Using my install as a test (amd64 desktop), I examined different
> ordering strategies for pieces in the main-amd64 Packages file (which
> accounts for 1486 out of the 1551 packages I have installed). The
> total size for this configuration is 1.1489 GB. I tried out different
> piece sizes from 32 KB to 512 KB, which gives the following number of
> pieces for the entire torrent:
>
> Piece Size 32 KB   466630 pieces
> Piece Size 64 KB   233315 pieces
> Piece Size 128 KB   116658 pieces
> Piece Size 256 KB   58329 pieces
> Piece Size 512 KB   29165 pieces
>
> I would say that over 100,000 pieces is getting quite large, and we
> would probably want to go with 256 KB or 512 KB piece sizes. The
> attached graph shows the results for 5 different orderings.

60k pieces at 20B per sha1, is an extra 1.2MB download, which bothers me
a bit; halving that to 600kB with 512kB pieces makes me more comfortable.

> The best
> is  sorting by Priority, then Section, then by Source package (I guess
> because I install multiple packages from the same source package quite
> a bit, which is interesting in itself). The results show a wastage of
> at least 10% for 256 KB, even for the best strategy. Not too good, in
> my opinion, but what do you think?

Yeah, I agree. Suprised at how much wastage there is even with 512kB.

> I haven't tried the sorting by popularity (hard), or by dependencies
> (harder) yet, so they might be better. If you have any other ideas for
> statistics you'd like to see related to this, let me know.

Can't you just grab the data from http://popcon.debian.org/by_vote for
sorting by popularity?

Actually doing that might start getting too complicated on the server
side, though.

> >The unavoidable drawback, afaics, is that piece boundaries will almost
> >never start on a file boundary, so there's no way to reuse the sha1 info
> >from the Packages file as the "piece" checksum info. Presumably we're
> >talking 1MB piece sizes, so that's an extra 14k sha1sums that need
> >to be calculated and distributed every pulse, by sha1'ing the entire
> >archive and adding a new 14k*20B so an extra 280kB per suite. But it's
> >the regenerating sha1s for something like 100GB of the archive twice
> >(or more) a day that sounds impossible to me.
> Maybe, I'm not sure what kind of machines/workloads we're talking
> about. Does this make the implementation unworkable? Should I bother
> trying to check other ordering strategies?
>
> Have you thought about how this SHA1'ing will be done for the variable
> piece sizes we've been talking about?

apt-ftparchive keeps a cache of the sha1's of packages, and just dumps
them out of the cache when it's generating Packages files. I figure
we'd just add the piece sha1's to that cache and be happy. Since it's
the same sha1's across many Packages files, very little recalculation
has to be done.

> I guess the easiest would be to pick a piece size and stick with it
> for everything, rather than use a bigger piece size for large
> packages, and a smaller one for small packages.

Yeah. I can't see any reason to do anything else.

> This gives the
> following for the 19091 packages in the same main-amd64 Packages file:
>
> Piece  |  Total # |         Number (%) of Packages that are
>  Size  | of Pieces|     1 piece    |  2-8 pieces  | > 8 pieces
> ------------------------------------------------------------------
> 32 KB  |  476464  |   5388 (28.2%) | 7789 (40.8%) | 5914 (31.0%)
> 64 KB  |  243926  |   8098 (42.4%) | 7008 (36.7%) | 3985 (20.9%)
> 128 KB |  128265  |  10814 (56.6%) | 5906 (30.9%) | 2371 (12.4%)
> 256 KB |   71209  |  13177 (69.0%) | 4586 (24.0%) | 1328 ( 7.0%)
> 512 KB |   43331  |  15106 (79.1%) | 3321 (17.4%) |  664 ( 3.5%)
>  1 MB  |   29920  |  16720 (87.6%) | 2053 (10.7%) |  318 ( 1.7%)
>
> This is again looking like 256 KB or 512 KB are the way to go, as any
> smaller creates too many pieces, and larger means almost all the
> packages are a single piece.

Is that a bad thing? I download dpkg, you download coreutils, then we
swap doesn't sound any worse than downloading half of coreutils each,
then swapping.

I was figuring assuming "1GB pieces" (ie, every package is a single piece)
will be the first step anyway, just to get something working.

> Again, any other statistics you'd like to
> see here? I had a hard time figuring out which to use, as the
> mean/median number of pieces per package are not really informative
> here.

Looks good; want to put what you've got up on the wiki?

Cheers,
aj


-----BEGIN PGP SIGNATURE-----

iD8DBQFGG1OGOxe8dCpOPqoRAmNhAJ4irUbiw1T8qIgkBWRSPLTSgK6L2ACeOJd/
7/+uiq1TP9W9P0CkqXAoIQc=
=+bm+
-----END PGP SIGNATURE-----



More information about the Debtorrent-devel mailing list