Redesigning the autopkgtest controller/workers CI system for Ubuntu and Debian -- call for spec review

Tue Apr 1 10:29:45 UTC 2014

Hey Vincent,

thanks for the review!

Vincent Ladeuil [2014-03-31 15:59 +0200]:
> "so that we can support workers which can only certain releases,"
> 
> *can only {support,serve} ?

Fixed in
https://wiki.debian.org/MartinPitt/DistributedDebCI?action=diff&rev2=12&rev1=11

> While your document aim at sharing design/code between
> debci/britney/auto-package-testing, I've tried to think about reducing
> divergences between them and the uci-engine we're building.

That's a great perspective indeed, thanks for checking it from that
PoV.

> "QEMU with KVM is only available on some architectures, on the others we
> want to run all tests which don't need QEMU and skip the rest. "
> 
> How do we skip ?

Also clarified in above update. Tests can specify
"isolation-container" or "isolation-machine", and on testbeds that
don't satisfy that (e. g. schroot for i-c or lxc for i-m) adt-run will
skip the test. Example:

  https://jenkins.qa.ubuntu.com/job/trusty-adt-udisks2-ppc64el/23/console

> "Or some tests may want to assume a fully running graphical desktop
> environment, while most tests just want to run in a minimal
> deboostrap-like environment."
> 
> How do we select them or distribute them to the right workers ?

My proposal for that is in the "Job creation" subsection, in
particular "there needs to be a manually maintained mapping of source
packages to these tags, e. g. "xorg-server: desktop_amd desktop_nvidia
desktop_intel")". This can just be in some Vcs, a bit similar to the
old cupstream2distro branch. Unlike the isolation restriction, that's
a property which isn't inherent to the test itself, but depends on
where we want to run them and which hw we have available (where "we"
also differs between Debian and Ubuntu, and between different releases
even).

> "strictly separate the test execution backends, ... and the policy
> (i. e. when to run which test)"
> 
> Where is the policy defined ?

For Ubuntu that's in britney; for Debian that's in the bin/debci batch
runner (the short and much simplified version: run a package's and all
of its rdepends' tests for any upload). There's also some kind of
"ad-hoc" policy, i. e. the maintainer of the CI engine can decide when
it's worth re-running a failed test manually (CLI tool or some web ui
button).

> Overall, it's still unclear to me whether adt-run should support a way
> to select tests or if the test themselves need to be able to skip when
> they miss support from the testbed. (I'm talking about dep8 tests here
> even if the same issues exist for individual tests in a test suite).

The latter IMHO. But these two are pretty much equivalent; skipping a
test which can't be run is fairly similar to actively selecting which
tests to run?

> = Design =
> 
> == Data structure ==
> 
> I think the data store supports a subset of a DFS so if we try to use
> the data store API as much as possible, the parts that requires a DFS
> will be narrowed.

I interpret DFS as "distributed file system"; what do you mean with
"will be narrowed"? All the files (artifacts) that we produce will
need to be put into a distributed fs.

> That may also allow debci to just not need to setup swift (and will
> avoid downloads/uploads to swift) since the files will just be produced
> locally.

Correct. That's why I want to separate the "local <-> distributed FS"
bits, I tried to point that out in the spec. If worker and debci are
on the same machine, it won't be required at all.

> "release/architecture/sourcepkg/YYYYMMDD_HHMMSS_hostname[_tag1[_tag2]]/autopkgtest_output_files"
> 
> I suspect that different users will want different orders or filters
> there.

Perhaps, but I hope not. It's really difficult to reorganize files in
swift. Which other use cases do you see?

Note that this doesn't directly get exposed to users; the web UI
should present this more appropriately. But this structure is meant to
represent roughly how a developer would like to browse the data.

> How about /release/sourcepkg/version/hostname/YMD_HMS ?

I don't want to make the hostname a "first-class" thing here. It's
only to disambiguate timestamps (i. e. avoid race conditions), and
host names get added, removed, or changed over time. Also, it's really
uninteresting where a test ran, the interesting bit is just which kind
of release/architecture/platform it supports.

> Since architecture and tags are specific to the host (correct ?), I
> don't feel they should appear here.

I think they do. As a developer I want to see how my test performs on
ARM or on an NVidia box. I don't care at all about how the CI system
names its worker nodes.

> === Queue structure ===
> 
> I'm not sure there is a distinction between the queue name and the host
> name in your proposal so I assumed they were the same.

They aren't. queue names encode the kind of platform where a test
wants to run on. host names are just an internal implementation detail
(see above). Conceptually it's best to ignore them entirely when
looking a the data, except when collecting the set of host names for
the health check.

> How about using FQDN for host names so a proper name space is defined
> and leave the hosts provide their specific tags on request ? This is
> related to test scheduling no ? (at least that's how I read 'platform
> tags' in 'Job creation')

For the reason above, I wouldn't like to do that.

> === Job consumption ===
> 
> "A worker node knows its supported architecture(s), the ADT
> schroots/containers/VMs it has available and their releases"
> 
> I'm worried that making the worker handle that added complexity will
> make things harder in various places (schroots failures vs worker
> failure for example).

Which added complexity? By "know" I mean "set in a configuration
file", where various values such as architecture and available
schroots/containers can have sensible defaults based on what's
available on the worker host.

> Why not defining a worker as supporting a single architecture and a
> single release and rely on the FQDN to carry that info ?

I think we must differentiate between a worker node as in (1) a piece
of iron that runs testbeds and adt-run, which has an FQDN (which is
generally uninteresting); and (2) an instance of debci that listens to
a particular queue (e. g. trusty-armhf-lxc). It's possible and very
likely that one host runs many debci instances, e. g. an amd64 host
can run i386 and amd64 tests, and that might have schroots or
containers for multiple distro releases available.

As such the FQDN is a bad place to encode all that IMHO. We don't want
to assume that we can call a machine
"adt-precise-trusty-saucy-i386-amd64-schroot-lxc-none-nvidia-7", that's
hard to decode, gets in the way of how the data center admins or nova
or the cloud provider assigns machine names, and dismisses the 1:n
relation of worker hosts to debci worker instances. We
want to separate and encode the test platforms in AMQP queue names
IMHO, which is much more flexible and easier to maintain.

> = Health check =
> 
> Great work. We don't cover as much in the uci-engine so far so I don't
> have a lot of feedback here (but we're actively discussing the same kind
> of issues but we have different kind of workers in addition to the ones
> running the tests).

Thanks. Indeed this won't be a part of the first version, but can be
added later. Initially, as simple nagios-based ping check of the
workers should suffice.

> === test hangs indefinitely ===
> 
> "At least in the first stage of the implementation it is recommended to
> rely on manually retrying tests after inspecting the reason."
> 
> By manually you mean from the highest level by end users ?

Where "end user" is someone like Antonio, Jean-Baptiste, or me; i. e.
the administrator of the CI engine and someone who regularly reviews
failure notifications.

> "During that time, the node won't accept any other request"
> 
> You mean the queue is closed ? Does that imply you always have available
> workers for any set of arch/release/platform flags ?

The queue never is closed in the sense that it can't take new
requests. But while a worker processes a test, it won't accept the
next test until the current run finishes. I. e. nothing fancy here,
just the usual "first worker node to finish one test will get the next
one" round-robin.

> === all AMQP brokers go down ===
> 
> +1 though you should probably mention that the same is true for swift.

Right, in this case the workers queue up the results locally. That's
mentioned in the goals: "If that's [the net fs] currently down, cache
it locally and flush it on the next run".

> == controller problems ===
> 
> "britney or CI request a test ... for which no workers are available"
> 
> May be that's the answer to my previous question ;)

There I mean "we don't have any workers which could process that
queue", not "all of the suitable ones are currently busy".

Thanks!

Martin

-- 
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)