[Debtorrent-devel] BitTorrent peer measurement, & debtorrent

Cameron Dale camrdale at gmail.com
Thu May 31 03:34:41 UTC 2007


(CCing the debtorrent-devel list)

On 5/29/07, John Gilmore <gnu at toad.com> wrote:
> I stumbled onto DebTorrent tonight (I think via the Debian blog).
> You're trying to do something almost like what I hired Tom Jennings to
> do a few years ago (he failed).  In his case I was trying to get Red
> Hat's up2date to use BitTorrent.

I (Google) was unable to find anything relating to this. Is there any
available information left somewhere? I'm always looking for ideas.

> I was one of Bram Cohen's sponsors while he was writing BitTorrent, so
> I've been watching the program and the protocol for a long time.

[snip]

> A fundamental problem in the design, as I
> understand it after reading the mailing list and the wiki, is that you
> haven't specified WHAT torrent(s) Debian is going to create.  It's all
> handwaving at this point.

The torrents in this case don't really exist (though it would be
possible to create them, just not necessary). We are currently using
the Packages files distributed by a Debian archive to get the
information normally obtained from a torrent file. So, you can think
of the Packages files as the torrent files. I think you knew that
though, I'm just stating it for clarity.

> E.g. a Torrent contains an MD5 hash (for the entire thing).  This hash
> is what identifies the Torrent to the tracker.  Every time you
> update Packages, you're going to calculate an MD5 hash over the entire
> multi-GB collection?  And only people who have the same Packages file can
> share with each other, even though they all have 99% the same files?
> This doesn't seem workable, but you haven't specified any way around it.

Actually, the torrent hash is only over the "info" dictionary in the
torrent file, which contains hashes of all the files in the torrent.
So, in one sense it is a hash over the entire multi-GB collection, but
to recalculate it you only need to hash the files that have changed
(all the unchanged file hashes are cached), and then hash the hashes.

The sharing with others who have mostly the same files is a definite
concern we've identified, since the testing/unstable Packages files
are currently updated twice a day (and in the future could be more
frequently). The current implementation ignores this, but we have some
ideas for the future to improve this. One is to have downloaders run
the n (maybe 10) most recent Packages/torrents, though this does not
solve the problem completely, and wastes resources on duplication.

The solution that I like best is to use unique piece numbers that
don't get reused as the torrent evolves. As updated/new packages get
added, they are given new piece numbers and added to the end of the
torrent. Peers all participate in the same torrent, but have different
ideas of how big the torrent is. When peers exchange piece
information, they disregard piece numbers they aren't aware of.
Implementing this is non-trivial however. For example, one issue is
that the piece numbers may get too large over time, which may
necessitate resetting (creating a new) torrent periodically.

> I do like the idea of the Torrent being a large collection of .deb's.
> Individual debs are too small for the BT protocol to work with.
>
> It seems to me that the hardest part of this project is going to be
> how you allocate .deb's into .torrent's.  You want a method that doesn't
> change the .torrent too often (so lots of people can share the stuff it
> points to), but that still gets updates fairly frequently.  You could
> try to extend BitTorrent to handle torrents of big files that get appended
> to (but never overwritten), by always adding new packages onto the end of
> your torrent image, perhaps -- but that may be hard.  Sticking at first
> with torrents that never change (once created) will simplify things.

I think you're suggesting the unique piece numbers I mentioned above,
in which case it is definitely hard, and we will be keeping things
simple for now with plans to complicate it later. ;)

> I generally download both the source and the binary CD/DVD images of
> Linux distibutions I plan to use, via BitTorrent.  Sometimes,
> particularly in beta periods, they are not served up as torrents, and
> I use "jigdo".  Jigdo, the Jigsaw Downloader, pulls down a "template"
> file that contains the "overhead" part of the .iso file.  It then
> starts downloading Deb packages from the mirrors, and stuffing them
> into the proper byte offsets into the .iso file.  Ultimately you get
> the entire .iso image that way.  This is fast because a lot more
> mirrors serve up the .deb's than those that serve up the .iso's.  See:
>
>   http://www.debian.org/CD/jigdo-cd/
>   http://www.einval.com/~steve/software/JTE/
>
> DebTorrent could do Jigdo in reverse: use BitTorrent to pull
> individual packages out of a Torrent containing a .iso image.  (The
> individual packages would not be adjecent in the image, since they'd
> be padded to iso9660 sector or filesystem block boundaries; but jigdo
> shows that the overhead between the packages is pretty minimal.)
>
> The variable-sized-piece extension to BitTorrent would not be needed in
> this case.  Knowing where in the .iso image your .debs of interest exist,
> you could just pull down all pieces that overlap with your .debs.

This is similar to an idea I had before, which was that instead of
creating variable-sized pieces you would do a regular BitTorrent
download of a specially ordered torrent. (I hadn't thought of
combining with the ISO image torrents though, which is a good idea.)
The problem is that a lot of the packages are smaller than any
normally used piece sizes, which results in a large amount of wasted
download bandwidth. I did some statistics (see the wiki page) and
found that even with the best possible ordering of packages, the
wasted download bandwidth is more than 10% for a typical server
configuration (and with a desktop configuration and a poor ordering of
packages, this could be as high as 50%). This seems to me to be
unworkable.

> (If you do make the variable-sized-piece extension, I *strongly* suggest
> implementing it cleanly in .torrent files, and submitting the
> extension to Bram for standardization.  For example, if the "piece
> length" field in the .torrent file is a list, then it must contain as
> many elements as the "pieces" field, and each one is the length of that
> piece.)

More on this below.

> Another way to manage the creation of .torrent files is to create them
> with different longevities.  For example, when a release CD is cut, make
> a long-lived torrent for the stuff in it.  Every day, make a torrent for
> the files that were modified that day.  Every week, make a torrent for the
> files that were modified between the last CD and now.  This means you
> could get most any reasonably current .deb by looking in about ten torrents.
> The server would stop advertising last week's torrents when it created this
> week's, though the tracker would continue tracking them.

This solves some problems relatively simply, and sounds better than
just tracking the n most recent ones, but still has some
disadvantages. Running multiple torrents means additional overhead in
communication and resources, and invalidates the tit-for-tat in
BitTorrent. For example, if A and B are both in several of the same
torrent swarms (which they probably will be), they may have multiple
connections to each other, each one taking up resources and requiring
communication overhead. Also, if A is uploading to B in one torrent, B
will not be encouraged (by the tit-for-tat choking mechanism) to
upload to A in another torrent. This method also suffers if the update
rate for the archive increases. For example, (I'm told) Ubuntu updates
their archive every 30 minutes, which would require running many
additional torrents.

> By the way, if you figure out a way to extend (lengthen) the file that
> a .torrent describes, you can define a BitTorrent protocol extension
> that asks, "Please send me a copy of your .torrent file", to get the
> later .torrent metafile from a peer.  I have wanted such a protocol
> extension before, when a tracker would tell me the interesting hashes
> and filenames it was serving up, but I couldn't find the .torrent
> files themselves.

This would be interesting, though as you mentioned in your other email
there is a trust issue here. One way to get around the trust issue is
to sign the torrent files (with GPG) on the server before distributing
them, so that all torrents can be verified by the clients. In fact,
all Packages files are hashed in signed Release files, so there may
even be a way to use that to verify them.

> There is no need to send a long bitmap if it's sparse.  Send "Have none"
> and then a bunch of "Have piece X" messages instead, if that's shorter.
> But that's a minor optimization; save it for last.
>
> If the user has a .iso image on their hard drive, this daemon CAN and
> SHOULD be offering up access to it -- both the entire image, and the
> .deb files within the image.  As soon as I install a Linux system
> these days, I always read in the .iso onto the hard drive.  Hard drive
> space is cheap, and when I later want to install some package, I have
> my apt-sources set up to get it from the .iso if it hasn't been
> updated.  And it makes it easy to burn another DVD for a friend if I
> ever want to.  And the image is easier to find on disk, than in my old
> random DVD pile.

This is a good idea, but will probably also get added at the end for simplicity.

> The people who designed GitTorrent didn't understand some parts of the
> BitTorrent protocol.  They (and you!) should be using the standard
> BitTorrent protocol, with an extension bit set in the handshake.  Why
> make a totally incompatible protocol when your changes are really
> pretty minor?  This means: use the same version string (0x19
> "BitTorrent protocol").  Get an extension bit assigned by Bram for
> your extensions; that way, you don't need to put a "version number"
> into the protocol string.  This is why the extension bitmap is there.
> Don't re-use any of the message encodings (or directory keys) from the
> existing protocol -- there are lots of byte values left to pick among.
> That way, the dozen+ implementations of BitTorrent will be able to
> easily add your extensions if they choose to.  Rather than, ahem,
> having to fork the code base, as you have done with BitTornado.

This is really interesting. I actually hadn't considered this as the
changes we were planning to make are quite large (variable sized
pieces, not using torrent files, adding pieces to a running torrent,
tracker modifications, etc...). Certainly not reusing message
encodings and keys is easy, and it should be relatively simple to
follow the standard on most things. I will definitely try to do this.

Thanks for the ideas,
Cameron



More information about the Debtorrent-devel mailing list