Redesigning the autopkgtest controller/workers CI system for Ubuntu and Debian -- call for spec review

Wed Apr 2 18:22:43 UTC 2014

>>>>> Martin Pitt <mpitt at debian.org> writes:

<snip/>

    >> Overall, it's still unclear to me whether adt-run should support
    >> a way to select tests or if the test themselves need to be able
    >> to skip when they miss support from the testbed. (I'm talking
    >> about dep8 tests here even if the same issues exist for
    >> individual tests in a test suite).

    > The latter IMHO. But these two are pretty much equivalent;
    > skipping a test which can't be run is fairly similar to actively
    > selecting which tests to run?

In theory, yes, in practice no. For example in a test suite there may be
a single test requiring a specific platform. In that case it's easier to
let the test skip rather than selecting all the others.

On the other hand, as a dev, I don't want to run all the tests all the
time, so I prefer to be able to select a single test to run or just the
tests that failed in the previous run.

    >> = Design =
    >> 
    >> == Data structure ==
    >> 
    >> I think the data store supports a subset of a DFS so if we try to use
    >> the data store API as much as possible, the parts that requires a DFS
    >> will be narrowed.

    > I interpret DFS as "distributed file system"; what do you mean with
    > "will be narrowed"? All the files (artifacts) that we produce will
    > need to be put into a distributed fs.

The code requiring a file system will be narrowed, I think it's a
smaller part (imbw).

<snip/>

    >> I suspect that different users will want different orders or filters
    >> there.

    > Perhaps, but I hope not. It's really difficult to reorganize files in
    > swift. Which other use cases do you see?

We're talking past each other ;)

    > Note that this doesn't directly get exposed to users;

That's what matter.

<snip/>

    >> === Job consumption ===
    >> 
    >> "A worker node knows its supported architecture(s), the ADT
    >> schroots/containers/VMs it has available and their releases"
    >> 
    >> I'm worried that making the worker handle that added complexity will
    >> make things harder in various places (schroots failures vs worker
    >> failure for example).

    > Which added complexity? By "know" I mean "set in a configuration
    > file", where various values such as architecture and available
    > schroots/containers can have sensible defaults based on what's
    > available on the worker host.

Sorry, that was unclear. I was thinking about the case where a guest
running tests crashes its host. In that case, having all the guests
dying... makes me feel the failures are more complex to handle. We avoid
that in the ci engine by having controllers and workers always running
on different hosts.

    >> Why not defining a worker as supporting a single architecture and a
    >> single release and rely on the FQDN to carry that info ?

Forget about that, I was trying to find a middle ground between
decorating the domain name and the FS layout, host names carry no info.
The ci engine will use arbitrary names for the hosts, they carry no
meaning at all but are unique. But they can't be described in a config
file either as they are created on demand. Nothing incompatible with
your proposal here.

    > We want to separate and encode the test platforms in AMQP queue
    > names IMHO, which is much more flexible and easier to maintain.

I think I understand where you're going but an example would help I
think ;)

You're saying that debci/britney create all combinations of (platform,
release, etc) and put the test requests in all matching queues, and the
queues filter/transport that to the available workers ?

I have a vague feeling that setting up new types of worker will require
creating new queues making the system a bit more static than it could
be.

I think it boils down to: either debci/britney knows which combinations
are supported and can emit only the relevant messages and they will just
reach the workers that can handle them.

But may be I'm thinking from a system where all combinations should be
supported and their testbeds created on demand.

<snip/>

    >> === test hangs indefinitely ===
    >> 
    >> "At least in the first stage of the implementation it is recommended to
    >> rely on manually retrying tests after inspecting the reason."
    >> 
    >> By manually you mean from the highest level by end users ?

    > Where "end user" is someone like Antonio, Jean-Baptiste, or me; i. e.
    > the administrator of the CI engine and someone who regularly reviews
    > failure notifications.

Great.

    >> "During that time, the node won't accept any other request"
    >> 
    >> You mean the queue is closed ? Does that imply you always have available
    >> workers for any set of arch/release/platform flags ?

    > The queue never is closed in the sense that it can't take new
    > requests. But while a worker processes a test, it won't accept the
    > next test until the current run finishes. I. e. nothing fancy here,
    > just the usual "first worker node to finish one test will get the next
    > one" round-robin.

Right, I was talking about "will flag the node for manual
inspection/fixing. " followed by "During that time, the node won't
accept".

I.e. does manual inspection/fixing requires putting the node offline ?

In our case, this requires having a pool of similar hosts and making
sure we don't run the same test on all the hosts in the pool. For
example: an image that can't be booted from for phones.

But anyway, I think I'd rather make the failures easier to reproduce
than dealing with hosts blocked waiting for a human intervention. Not
sure I'm still on-topic here though.

    >> === all AMQP brokers go down ===
    >> 
    >> +1 though you should probably mention that the same is true for swift.

    > Right, in this case the workers queue up the results locally. That's
    > mentioned in the goals: "If that's [the net fs] currently down, cache
    > it locally and flush it on the next run".

Right, with persistent storage that works, one more thing we need to
think about for the ci engine which uses ephemeral workers.

    >> == controller problems ===
    >> 
    >> "britney or CI request a test ... for which no workers are available"
    >> 
    >> May be that's the answer to my previous question ;)

    > There I mean "we don't have any workers which could process that
    > queue", not "all of the suitable ones are currently busy".

Ok, makes more sense now in light of the way you use queues.

    Vincent