[Debtorrent-devel] Fwd: BitTorrent Protocol Expansion (Google SoC)

Fri Apr 13 01:51:21 UTC 2007

---------- Forwarded message ----------
From: Anthony Towns <aj at azure.humbug.org.au>
Date: Apr 8, 2007 4:17 PM
Subject: Re: BitTorrent Protocol Expansion (Google SoC)
To: Cameron Dale <camrdale at gmail.com>

On Sun, Apr 08, 2007 at 11:08:22AM -0700, Cameron Dale wrote:
> Anthony Towns wrote:
> > I think we've got a bit of a disconnect here -- "all downloaders of
> > ___ need to be aware of all other downloaders" doesn't sound like what
> > happens in bittorrent at all to me -- you just get a random selection
> > of other downloaders and go from there. Feel free to go into academic
> > p2p lecture mode on that for a paragraph if you like :)
> Sorry, that's some bad wording on my part. Basically I'm just saying
> that all downloaders of a package should be in the same swarm as all
> other downloaders of the package, so communication doesn't always
> happen, but is always possible.

I'm not sure I agree, but from the sounds of the rest of the mail, might
be better to leave worrying about that 'til we can argue over beers or
something :)

> >     - sharing files between torrents seems a worry, but a necessary
> >       one since we'll want to not double the space people need
> >       to watch testing and unstable.
> Something like the pooling done by the archive would seem to make sense.
> We just need to be careful not to run into problems where 2 torrents
> are trying to update the same pool file.

I guess we could just do locking?

> >     - there's a real issue if the torrents have the same "file"
> >       with different contents (an old torrent had a file in the pool,
> >       which was deleted, then later recreated with new contents
> >       and included in a new torrent, eg. *should* never happen,
> >       but not 100% assured)
> I'm having trouble visualizing this happening. Could you give a more
> concrete example? I would think that the new torrent would either be a
> replacement for the old one, so you would never run both at once,

One scenario is something like foo_1.0.orig.tar.gz gets uploaded for
foo 1.0, then obsoleted by foo_1.1.orig.tar.gz, but then an epoch gets
added and a different foo_1.0.orig.tar.gz get uploaded for foo 1:1.0.

The archive won't allow that to happen simultaneously, but it could happen
with a week or so, so that the old foo_1.0.orig.tar.gz might be on your
system still while you're trying to get the new foo_1.0.orig.tar.gz.

> or it
> would not exist at the same time as the old one, so when you update to
> get the new one you would drop the old one.

It's more a question of making sure you _do_ drop the old one.

> > Well, we can always make the initial requirement/expectation be that
> > everyone mirrors the entire torrent as far as possible. Even if not
> > everyone does, it'll increase the odds enough to be sustainable, and
> > for a first version, I think that's fairly reasonable anyway.
> This seems like an over-simplification that might not be necessary. The
> wasted download bandwidth could be huge, and this basically removes the
> requirement of any communciation with apt about which files to download.
> (since all are). I think we can say that some communcation from apt
> about which packages to download is necessary, and the ability to
> selectively download files from a torrent is already built in to most
> clients.
>
> I think a better initial requirement/expectation is that
> torrents/Packages are kept up to date. This could later be maintained
> using some kind of automatic updating, but is not necessary at the
> beginning.

Sure. I'd have no complaints about having both expectations initially
even. I'd be very surprised if early adopters stuck to them anyway,
but at least then it's their own fault if things don't work properly :)

> > Oh, that's not actually sufficient btw -- we can easily have two files
> > in the pool with the same SHA1. This'll particularly happen if two source
> > packages use the same upstream (eg, contrib/f/foo/foo.orig.tar.gz becomes
> > main/f/foo/foo.orig.tar.gz when one of its dependencies becomes free,
> > or a source package gets renamed within a component without its upstream
> > changing). You can handle that adequately on the client side of course,
> > without worrying about it in the protocol.
> So now we can't use the SHA1 hash or the path to the file as unique
> identifiers. How about a combination of the two? It seems that should be
> unique.

On the client side you can say "I need to fill up file <path> with
contents <sha>; dear torrent, give me <sha>" and then just write that
into <path>; the fact that other peers might be giving you <sha> from
a different path doesn't matter, since the content is the same anyway.
It means you need to keep a local map from <sha> <-> <path>, but you have
that in the Packages file anyway. It also nominally means you can avoid
unnecessarily downloading the same contents twice for two different paths.

That's only relevant if you don't have piece numbers though, I think.

> That still leaves sid though, which is troublesome. [...]
> Since we're not focusing on this issue now,
> this is only something to think about for the future though.

Sounds fine, yeah.

> A lot of the changes we've talked about (including the one you mentioned
> in the previous paragraph) require some kind of modification to the
> archive software, and I haven't yet considered how easy/fast/possible
> these changes will occur.

Adding information that's calculated from the .deb to the Packages file
(like separate sha1's for each x kB block in the .deb, for some constant
x) is easy enough; adding information that's based on the package but not
specific to a version is easy too; adding information that's specific
to a particular file but can't be calculated from the file directly is
new and presumably hard.

> For now, we can use the current Packages files
> as single-piece-per-package torrent files (though some pieces would be
> very large), but eventually some functionality would need to be added to
> dak/apt-ftparchive to implement more interesting/efficient features. Who
> would we need to talk to about this, and how responsive will they be to
> supporting something that will be alpha/beta for a long time?

The easy changes can be done with me, mvo and daniel without much problem.

The hard change needs changes to apt which might be difficult to code,
and will require mvo checking them in some detail at least; and also
will require changes to dak, which will require more testing and review
by ftpmaster (me, James Troup, Ryan Murray).

> This sounds like it might be a good goal for the mid-term (July 9)
> evaluation point. A working program that does the following:
>  * uses current Packages files as torrent files
>  * implements variable size pieces for all packages
>  * runs as a daemon, started from /etc/init.d
>  * receives input from apt about which packages/pieces to download
>  * will share all downloaded pieces with other interested clients
> What do you think?

I'll be very impressed if that much is done by then.

> From there we can decide which additional
> functionality to implement and get something added to dak/apt-ftparchive
> to support it.

Sounds like the next steps would be:

   * break packages into smaller pieces

   * determine usage patterns

   * based on measured usage patterns analyse ways to optimise sharing
     amongst different Packages files (arch:all, different versions,
     testing/unstable, tesintg/stable at release)

Hrm. I'm not sure "determine usage patterns" can happen quickly enough for
the "analyse" step to actually happen as part of the GSoC. Any thoughts?

For the first half, ordering as:

   1. will share all downloaded pieces with other interested clients
        (already what bittornado does!)
   2. implements variable size pieces for all packages
        (use hacked up .torrent files that will get you an "interesting" bit
         of the archive)
   3. uses current Packages files as torrent files
        (add Packages -> torrent parsing into bittornado so you don't need to
         download duplicate information)
   4. receives input from apt about which packages/pieces to download
        (add separate scripts to parse /var/lib/dpkg/available and/or apt
         to prioritise pieces?)
   5. runs as a daemon, started from /etc/init.d
        (automate it all)

might work well -- that way it's possible to start releasing usable
betas right from step (2).

Cheers,
aj

-----BEGIN PGP SIGNATURE-----

iD8DBQFGGXf5Oxe8dCpOPqoRAkLUAJwJJt4w5WwegXMpAz4oaAQGzRepWwCgg6mw
5XEjKofOa0tzOR6ipxcHZqk=
=+Nui
-----END PGP SIGNATURE-----