Redesigning the autopkgtest controller/workers CI system for Ubuntu and Debian -- call for spec review

Wed Apr 2 20:31:31 UTC 2014

Hey all,

Vincent Ladeuil [2014-04-02 20:22 +0200]:
>     > We want to separate and encode the test platforms in AMQP queue
>     > names IMHO, which is much more flexible and easier to maintain.
> 
> I think I understand where you're going but an example would help I
> think ;)
> 
> You're saying that debci/britney create all combinations of (platform,
> release, etc) and put the test requests in all matching queues, and the
> queues filter/transport that to the available workers ?

Right. If we want to gate unstable->testing or proposed->release
propagation on e. g. i386, amd64, and armhf, britney would send test
requests of package foo to the sid-i386-foo, sid-amd64-foo, and
sid-armhf-foo queues. If our manually maintained test override mapping
says that, it would alternatively put it into
sid-xorg-{i386,amd64}-{nvidia,amd,intel} queues, and so on.

> I have a vague feeling that setting up new types of worker will require
> creating new queues making the system a bit more static than it could
> be.
> 
> I think it boils down to: either debci/britney knows which combinations
> are supported and can emit only the relevant messages and they will just
> reach the workers that can handle them.
> 
> But may be I'm thinking from a system where all combinations should be
> supported and their testbeds created on demand.

I don't think we can create testbeds for new architectures on demand.
But perhaps encoding the architecture into the queue name isn't
necessary; I did that to ensure that a request in a queue gets
consumed exactly once. With a fanout queue (e. g. sid-xorg), workers
for all available architectures would grab it, but we would then need
to verify the available results (for every architecture) with polling
instead of waiting for the AMQP requests and getting the re-queueing
on failing workers for free. So my gut feeling is that maintaining a
static list of architectures (it doesn't change that often) is worth
the little overhead for getting the atomic acks/automatic requeueing
on failures in return.

But I'm happy to change that if a fanout queue makes more sense here.

> Right, I was talking about "will flag the node for manual
> inspection/fixing. " followed by "During that time, the node won't
> accept".
> 
> I.e. does manual inspection/fixing requires putting the node offline ?

Not necessarily; I guess usually you want to inspect the worker in the
exact failure state that it is at that moment. So in the best case you
can just kill the hanging process, let the test fail, and have the
worker continue. Of course this shold only really be an issue for the
schroot runner; LXC and QEMU provide enough isolation that tests
should never be able to circumvent adt-run's builtin timeout.

> In our case, this requires having a pool of similar hosts and making
> sure we don't run the same test on all the hosts in the pool. For
> example: an image that can't be booted from for phones.

Yes, we need redundancy in the workers anyway, as a single runner
won't be able to keep up (and also would be a SPOF).

> But anyway, I think I'd rather make the failures easier to reproduce
> than dealing with hosts blocked waiting for a human intervention. Not
> sure I'm still on-topic here though.

The case that a worker is hanging indefinitely and adt-run's timeout
is also broken should really be quite rare. If it actually happens,
it's really worth investigating why, and there's surely some
kernel/hardware bug coming into play.

Thanks!

Martin

-- 
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)