Redesigning the autopkgtest controller/workers CI system for Ubuntu and Debian

Wed Mar 19 22:38:32 UTC 2014

On Fri, Mar 14, 2014 at 01:13:35PM +0100, Martin Pitt wrote:
> >   But right now you can already drive test runs for multiple
> >   architectures from a single debci setup. e.g. you could for instance
> >   already run tests for any architecture supported by qemu user
> >   emulation -- although I didn't test that yet, nor think it's useful as
> >   a test scenario.
> 
> With "single debci setup" you mean a single server? Or a group of
> servers where one acts as a controller and a set of workers (which
> could have different architectures and/or adt-runners). I suppose
> the former, but I'd like to confirm.

With the code from _today_, the former.

> > - multiple backends, so it's now already possible to implement a new
> >   backend that will run the tests in other ways than just using
> >   adt-virt-schroot locally.
> 
> Right, I've seen that. That looks straightforward to extend to e. g.
> using LXC etc., from a worker machine POV. As far as I can see, there
> is no job distribution to a set of workers on different machines yet,
> right? That's the part that I'd like to use AMQP for.

Right. My idea is to enable this job distribution by implementing a
`remote` backend that will distribute the jobs to other workers, but in
a way that is transparent to the debci master/controller.

> > >  * A swift installation, providing sufficient storage space and
> > >    redundancy. We already have one for CI/QA in Ubuntu, and we'll need
> > >    to set up one for Debian (that's the only bit that actually
> > >    requires some thought and knowledge).
> > 
> > How much space you expect this to require?
> 
> However much space we currently need to keep all the logs and all the
> artifacts from all the runs at least for one given release, times the
> number of replications. After a distro release we can probably delete
> most of the logs and just keep the most recent one for each package.
> So maybe 50 GB or so, with three replicas to provide redundancy
> and failover?
> 
> > Is something more complicated than a simple filesystem location
> > really needed?
> > 
> > I mean, that filesystem location could be backed up by any distributed
> > FS, really, but do you really need the tools to care?
> 
> Network file systems have different semantics than local file systems,
> so I think we do need to care. E. g. what we discussed about creating
> unique file names and locking, that's something which is easy on a
> single local fs and hard/impossible on a distributed FS.
> 
> Other than that, swift is "just" a distributed FS, but one which
> avoids SPOF (single points of failures), unlike e.g. NFS. We are using
> that as standard technology in Canonical. But if you don't like it for
> Debian, it's fairly easy to provide support for both. After all, the
> interface is rather small:
> 
>   store_logs_for_test(local_directory)
> 
> for the workers, and something like
> 
>   get_package_list()
>   get_logs_for_package(package)
> 
> for the web UI and britney (probably a little refined to also query
> for arch, etc.)

Fair enough. Let's see how this plays out.

> > I think an architecture very similar to this could be implemented to this by
> > extending debci:
> 
> Yes, I agree. Some notes:
> 
> > 
> > - add a new `remote` backend, which will implement test runs as
> >   follows:
> > 
> >   - publish a message to and autopkgtest_* asking for a given package to
> >     be tested
> 
> Right, this will be some 10 lines of python to connect to an AMPQ and
> issue a request.

Exactly. :)

> Also, I wouldn't like only a debci setup to be able to issue requests.
> We really want britney (both in Debian and Ubuntu) to issue requests,
> and it also seems useful to have some command line admin tools for
> "retrigger test for package foo", or "retrigger all packages", or
> "retrigger all failed packages since yesterday" (e. g. if you fix a
> bug which broke testing).

This already exists. So debci has 2 binaries for this: `debci`, which
triggers a run of the entire archive, and `debci-test`, which will
process a single package (and is the used by the former). In both cases,
every package will only be really processed if there was any change in
its dependency chain (including itself) since the last time it was
processed.

I am still working on this external API, so e.g. these binaries should
probably get names that better reveal their intent.

> >   - wait for the test to finish
> 
> As I wrote I'd like to implement this by checking the results in the
> distributed FS, so that stateless clients like britney can do this at
> any time, and be robust against temporary failures. If we'd send a
> message back from the worker to the requestor, this could easily get
> lost.
> >   - collect results (log file + adt-run exit status)
> 
> That would then come from the distributed FS.

I see your point. I still need to figure out how exactly to implement
this on debci, but it shouldn't be hard.

> > - add a remote worker daemon that would listen to the queue, run the
> >   tests against a local backend (schroot, or lxc/kvm when those are
> >   available) and send the results back in a results queue (log file +
> >   autopkgtest exit status)
> 
> Right. That's the toy PoC worker that I have in my +junk branch. With
> AMQP doing all the real work, this is really just a bit of glue code
> between AMQP, adt-run, and the distributed fs (swift in the PoC). I
> must say it's delightfully simple, the whole worker has less code than
> a single Jenkins job configuration XML :-)

:)

> > - make britney read from debci data API (which is the plan AFAICT for
> >   using the autopkgtest test results in Debian testing migration)
> 
> I suppose you mean the .json files? Yes, that would work. In fact, I
> think we should write the workers so that they already spit out the
> data format that we desire (i. e. json). Then all this would happen
> completely asynchronous and not depend on a central debci instance to
> collect, read and convert the data from the workers (remember, we are
> trying to eliminate single points of failures).
> 
> So the "data API" would then just be "check the distributed FS", or
> rather above mini-API like get_logs_for_package(package) which could
> then be backed by swift or a simple local fs (if you like that better
> for Debian).

Sure. Let's see how it plays out.

> > The only point that doesn't fit with your ideas is the fact that each
> > debci run currently _waits_ for all package tests to finish. The reason
> > for that is very prosaic: it's  just to be able to generate the needed
> > indexes (i.e. consolidated .json files in /data/$suite-$arch/) in a
> > concurrency-free way (also the overall run is inside a critical section
> > so you cannot have two concurrent debci runs for the same suite/arch
> > pair).
> 
> What is a "debci run"? A test run for a set of packages? Then this indeed
> sounds like a SPOF again: we don't want to block the complete
> machinery because a single test worker goes AWOL, or e. g. the workers
> for a particular architecture are currently under maintenance.
> 
> Or do you mean "debci run" == some regular invocation of debci to
> update the aggregated .json and web ui files for the latest runs?
> That's presentation only and sounds fine, as it doesn't interfere
> with, or is a blocker for britney and the workers.

currently a debci run does both:

  - processing every package that has a reason to have its tests run
    (updated dependencies, enough time since last run etc)
  - creating the global data based on the .json files for individual
    packages. I agree these could be decoupled.

I agree that these should be decoupled.

> I intend the whole machinery be 3 different parts which are completely
> independent of each other: britney (or a CLI, or a button on some web
> UI) trigger requests, AMPQ does the task distribution to worker nodes,
> and the web UI is only for presenting the data (well, it could also
> grow some "admin" functionality to re-trigger tests etc., but that
> wouldn't be its main purpose). That makes it easy to test new web UI
> versions locally or for the web UI to go down without breaking britney
> and propagation of packages into the archive/testing. It also avoids
> the presentation layer to be a SPOF (such as Jenkins currently is).

Yeah. One of the reasons to make the web UI static HTML + javascript was
to have it being only about presentation. But then stuff like search etc
gets complicated. We will have a GSOC student looking into this.

> > Depending on the latency that you need, that may not even be a problem.
> > For Debian, given that the archive updates updates 4 times a day, with
> > enough workers running in parallel that would not be problem IMO. How
> > often does the Ubuntu archive update?
> 
> It updates about two or three times per hour, so about 50 times a day.
> But that only matters for the delay that you have between a package
> upload and the package (and its binaries) appearing on the http
> archive so that the tests can actually install them.
> 
> The test machinery shouldn't care. It should get triggered by britney
> when new packages to be tested get built and are found to be
> installable. So in Ubuntu we get a stream of package requests all the
> time, while in Debian we get a big batch of test requests four times a
> day; the overall worker load (i. e. how many we need per arch) should
> be roughly the same though?

Probably. Since debci will already pick all packages that have reasons to
be tested, one could schedule it to run every minute, and it will test
stuff as soon as they get into the archive.

> > > If you think it's helpful, we can also organize a Google Hangout and
> > > talk face to face sometime soon?
> > 
> > That would be nice I think. Late next week should work for me.
> 
> Great! Do you have a google account for hangout? (Fine to send me with
> private mail, of course).

My work email is a google account: antonio.terceiro at linaro.org

-- 
Antonio Terceiro <terceiro at debian.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://lists.alioth.debian.org/pipermail/autopkgtest-devel/attachments/20140319/89340810/attachment.sig>