[Debtorrent-devel] Fwd: BitTorrent Protocol Expansion (Google SoC)

Cameron Dale camrdale at gmail.com
Fri Apr 13 01:51:04 UTC 2007


---------- Forwarded message ----------
From: Cameron Dale <camrdale at gmail.com>
Date: Apr 8, 2007 11:08 AM
Subject: Re: BitTorrent Protocol Expansion (Google SoC)
To: Anthony Towns <aj at azure.humbug.org.au>


Anthony Towns wrote:
> I think we've got a bit of a disconnect here -- "all downloaders of
> ___ need to be aware of all other downloaders" doesn't sound like what
> happens in bittorrent at all to me -- you just get a random selection
> of other downloaders and go from there. Feel free to go into academic
> p2p lecture mode on that for a paragraph if you like :)

Sorry, that's some bad wording on my part. Basically I'm just saying
that all downloaders of a package should be in the same swarm as all
other downloaders of the package, so communication doesn't always
happen, but is always possible.

>> Some numbers on this would be nice, but
>> I have no idea where we could get them from.
>
> IP analysis of ftp.debian.org logs could work.

Sounds good.

> I guess my theory's is that if you're going to have a daemon running
> constantly sharing files with other people on the net, you're going to
> be fairly up to date anyway -- and if you're only up once a month or
> whatever, you might as well be sharing the current files then anyway.

That's true. We could also build in some automatic updating
functionality for Packages files to help make this true (something for
the future though, not needed now). The old versions of packages you
have downloaded wouldn't be useful anymore though, since if everyone is
up to date then no one requests the old ones. That's probably not a bad
thing though.

> At any rate, sounds like we've got two plausible implementations, that
> aren't really all that different, so seems worth analysing tradeoffs, no?
>
> My main concerns are implementational:
>
>       - changing piece sizes from "constant + one small piece at the end"
>         to variable is a major change
>
>       - sharing files between torrents seems a worry, but a necessary
>         one since we'll want to not double the space people need
>         to watch testing and unstable.

Something like the pooling done by the archive would seem to make sense.
 We just need to be careful not to run into problems where 2 torrents
are trying to update the same pool file. There might also be some way to
run testing and unstable as a single torrent, using the unique piece
numbers. Just a thought.

>       - extending torrents as time passes seems "new" and might be
>         difficult to implement; possibly that should be left 'til later

An excellent point. Simplify now, complicate later.

>       - there's a real issue if the torrents have the same "file"
>         with different contents (an old torrent had a file in the pool,
>         which was deleted, then later recreated with new contents
>         and included in a new torrent, eg. *should* never happen,
>         but not 100% assured)

I'm having trouble visualizing this happening. Could you give a more
concrete example? I would think that the new torrent would either be a
replacement for the old one, so you would never run both at once, or it
would not exist at the same time as the old one, so when you update to
get the new one you would drop the old one.

> I think the goal has to be getting something that works (and can
> reasonably be made more efficient in future) than getting something that's
> as good as possible first. But you know a lot more about bittornado than
> I do.

Agreed. Getting a working version should be priority one. New additions
to improve it can be made afterwards, as long as we don't shoot
ourselves in the foot by making a choice now that prevents future
improvements. ;)

> Well, we can always make the initial requirement/expectation be that
> everyone mirrors the entire torrent as far as possible. Even if not
> everyone does, it'll increase the odds enough to be sustainable, and
> for a first version, I think that's fairly reasonable anyway.

This seems like an over-simplification that might not be necessary. The
wasted download bandwidth could be huge, and this basically removes the
requirement of any communciation with apt about which files to download.
(since all are). I think we can say that some communcation from apt
about which packages to download is necessary, and the ability to
selectively download files from a torrent is already built in to most
clients.

I think a better initial requirement/expectation is that
torrents/Packages are kept up to date. This could later be maintained
using some kind of automatic updating, but is not necessary at the
beginning.

>>> Treating the path in the pool
>>> (pool/main/g/gamin/libgamin0_0.1.7-4_powerpc.deb)
>>> as unique-per-file should be fine for that in almost all cases, fwiw.
>> If you mean communicating the path as the unique piece identifier,
>> then this is the same as using the SHA1 hash of the piece as the piece
>> number, instead of using some kind of sequential piece numbering.
>
> Oh, that's not actually sufficient btw -- we can easily have two files
> in the pool with the same SHA1. This'll particularly happen if two source
> packages use the same upstream (eg, contrib/f/foo/foo.orig.tar.gz becomes
> main/f/foo/foo.orig.tar.gz when one of its dependencies becomes free,
> or a source package gets renamed within a component without its upstream
> changing). You can handle that adequately on the client side of course,
> without worrying about it in the protocol.

So now we can't use the SHA1 hash or the path to the file as unique
identifiers. How about a combination of the two? It seems that should be
unique.

>>> sid and experimental don't have a defined endpoint; I'm not sure what
>>> you'd want to do about them. I'm not sure what (if anything) you'd do
>>> when a new suite (like lenny) gets introduced either.
>> I'm not sure what to do with sid and experimental either.
>
> experimental's small enough you can just ignore it.

That still leaves sid though, which is troublesome. Perhaps a new
torrent every time there is a release could be used, even though there
is no change in the archive. Or, there might be some way to include
testing and unstable in the same torrent, which would again lead to a
new torrent after a release. Since we're not focusing on this issue now,
this is only something to think about for the future though.

>> Any idea you can give me
>> on this project's chances?
>
> So far it seems fine, no negative comments, a variety of support;
> currently #4, but I think we'll have to drop one of #2/#3. None of that's
> really meaningful until we do a final review though.

Sounds good so far. Fingers crossed. :)

>> I'm having some trouble judging the tone of your emails sometimes. We
>> seem to be going back and forth a lot on the same issue, and I'm not
>> sure if it's because you really dislike my proposal, don't understand
>> it, or are just trying to generate discussion (maybe as a test?).
>
> (a) I think we're covering a bunch of the issues that'll end up being
>     important, including how we "name" pieces, and what we expect peers
>     to actually be doing
>
> (b) Dealing with the ways the archive changes is important and difficult,
>     so seems worth discussing up front
>
> (c) I like discussing the concepts heavily up front prior to implementation,
>     consider it a character flaw and don't think it's a reason to stop from
>     diving in to implementation, particularly if it can be changed later :)
>
> (d) It's not my pet implementation, of course I dislike it :)
>
> I'm presuming that since it's _your_ pet implementation, you're more
> than happy to keep defending it :)

Of course. :) Thanks for the explanation.

> ] create a torrent for every combination of suite
> ] (stable/testing/unstable) and architecture, including separate ones for
> ] architecture:all and source
>
> If that's going to happen, it seems to me like the way to do it is to
> add a feel to the top level Release file (dists/testing/Release etc)
> like "Torrent-Prefix: xyzzy" and have the torrent be identified using
> that string, the component, (main, contrib, etc), and the architecture;
> all sha1'ed or whatever as appropriate. That makes it fairly easy to
> choose when to reset the torrent, and also lets you share a torrent if
> you like (ie, testing and unstable could both use the same prefix).

Sounds like a good plan.

> You also need /some/ way to identify pieces, which is presumably going to
> be a long string (SHA1 of contents, name from pool etc) or an "arbitrary"
> piece number that's going to have to be kept somewhere and distributed
> as part of the Packages file. The latter is something that would have to
> be stored in dak (the archive management scripts/database) and added to
> the Packages files through apt-ftparchive somehow. I'm not sure that'll
> be easy, so I'd be really cautious about letting it be a showstopper
> for GSoC.

A lot of the changes we've talked about (including the one you mentioned
in the previous paragraph) require some kind of modification to the
archive software, and I haven't yet considered how easy/fast/possible
these changes will occur. For now, we can use the current Packages files
as single-piece-per-package torrent files (though some pieces would be
very large), but eventually some functionality would need to be added to
dak/apt-ftparchive to implement more interesting/efficient features. Who
would we need to talk to about this, and how responsive will they be to
supporting something that will be alpha/beta for a long time?

> Another thought: having a deliberate beta with one torrent per Packages
> file with the explicit assumption that it'll be a lot less than optimal
> would let us get some real measurements of what people actually do,
> just by monitoring the tracker.

This sounds like it might be a good goal for the mid-term (July 9)
evaluation point. A working program that does the following:

 * uses current Packages files as torrent files

 * implements variable size pieces for all packages

 * runs as a daemon, started from /etc/init.d

 * receives input from apt about which packages/pieces to download

 * will share all downloaded pieces with other interested clients

What do you think? From there we can decide which additional
functionality to implement and get something added to dak/apt-ftparchive
to support it.

Cameron
-------------- next part --------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFGGS+aDx924g0gNq0RAsX8AJ4wS8Sjko/GJ9AXQr+v9GUiN9xjYgCfTZYu
MFqdsVTJ91XuRzPbOXX3Hbg=
=9d7Z
-----END PGP SIGNATURE-----


More information about the Debtorrent-devel mailing list