Redesigning the autopkgtest controller/workers CI system for Ubuntu and Debian -- call for spec review

Mon Mar 31 13:59:03 UTC 2014

>>>>> Martin Pitt <mpitt at debian.org> writes:

    > Hey all,
    > thanks everyone for the meeting this week. Based on the discussion on
    > this mail thread, the hangout, various IRC meetings, and some
    > investigations I now drafted a spec about the goals, design, data
    > structure, failure case analysis, and debci TODO list. I'd would
    > appreciate if you could take some minutes to review this, check if you
    > agree on the design, try to come up with other scenarios what can go
    > wrong, etc:

    >   https://wiki.debian.org/MartinPitt/DistributedDebCI

Great stuff, lots to read ;)

    > It's a wiki, so please feel free to correct/adjust/amend stuff.

Account verification pending for the last hours so here is the thing I
wanted to fix:

= Queue structure =

"so that we can support workers which can only certain releases,"

*can only {support,serve} ?

    > Discussion here is also welcome, of course.

A few remarks below but nothing blocking.

While your document aim at sharing design/code between
debci/britney/auto-package-testing, I've tried to think about reducing
divergences between them and the uci-engine we're building.

TD;LR: Apart from the health check and some minor details in how we use
queues, nothing incompatible caught my eyes.

= Goals = 

Selecting which dep8 tests to run come back a few times, sometimes mixed
with scheduling them.

"QEMU with KVM is only available on some architectures, on the others we
want to run all tests which don't need QEMU and skip the rest. "

How do we skip ?

"Or some tests may want to assume a fully running graphical desktop
environment, while most tests just want to run in a minimal
deboostrap-like environment."

How do we select them or distribute them to the right workers ?

"strictly separate the test execution backends, ... and the policy
(i. e. when to run which test)"

Where is the policy defined ?

Overall, it's still unclear to me whether adt-run should support a way
to select tests or if the test themselves need to be able to skip when
they miss support from the testbed. (I'm talking about dep8 tests here
even if the same issues exist for individual tests in a test suite).

= Design =

== Data structure ==

I think the data store supports a subset of a DFS so if we try to use
the data store API as much as possible, the parts that requires a DFS
will be narrowed.

That may also allow debci to just not need to setup swift (and will
avoid downloads/uploads to swift) since the files will just be produced
locally.

"release/architecture/sourcepkg/YYYYMMDD_HHMMSS_hostname[_tag1[_tag2]]/autopkgtest_output_files"

I suspect that different users will want different orders or filters
there.

How about /release/sourcepkg/version/hostname/YMD_HMS ?

Since architecture and tags are specific to the host (correct ?), I
don't feel they should appear here.

But also see below for more about hostnames.

== Job distribution/management ==

=== Queue structure ===

I'm not sure there is a distinction between the queue name and the host
name in your proposal so I assumed they were the same.

How about using FQDN for host names so a proper name space is defined
and leave the hosts provide their specific tags on request ? This is
related to test scheduling no ? (at least that's how I read 'platform
tags' in 'Job creation')

=== Job consumption ===

"A worker node knows its supported architecture(s), the ADT
schroots/containers/VMs it has available and their releases"

I'm worried that making the worker handle that added complexity will
make things harder in various places (schroots failures vs worker
failure for example).

Why not defining a worker as supporting a single architecture and a
single release and rely on the FQDN to carry that info ?

= Health check =

Great work. We don't cover as much in the uci-engine so far so I don't
have a lot of feedback here (but we're actively discussing the same kind
of issues but we have different kind of workers in addition to the ones
running the tests).

== Test-induced problems ==

=== test hangs indefinitely ===

"At least in the first stage of the implementation it is recommended to
rely on manually retrying tests after inspecting the reason."

By manually you mean from the highest level by end users ?

"During that time, the node won't accept any other request"

You mean the queue is closed ? Does that imply you always have available
workers for any set of arch/release/platform flags ?

== AMPQ problems ==

+1

=== all AMQP brokers go down ===

+1 though you should probably mention that the same is true for swift.

== controller problems ===

"britney or CI request a test ... for which no workers are available"

May be that's the answer to my previous question ;)

Sorry for the ~lenghty email but your document was also 10 pages long
(and full of great suff ;)

      Vincent