[Daca-general] scan-build and metrics gsoc proposals and DACA

Sun Mar 17 17:46:38 UTC 2013

Hi Sylvestre, Zack, everyone,

While going through the list of gsoc proposals I found two that are 
basically related to my idea of DACA.

Sylvestre's scan-build proposal faces basically the same problems DACA does  
(job scheduling, and data reporting). However, if the tool evolves and is 
left as a service then it could probably reach the point that DACA has 
faced: problems with data storage, package and tool versions.

The current project proposal seems a bit simple to me; at least as far as 
the description goes. Scalability in general is not even mentioned.
The results can obviously be put somewhere under qa.d.o/daca; but just so 
that there is no confusion: there is no "magic" behind the html reports 
other than some makefiles and php scripts whose output is stored as static 
files.

Then there is Zack's proposal about metrics which is also another goal that 
DACA aims to solve (obviously, there hasn't been much visible progress on 
that side).
DACA is a potential source of plenty of data points: how many issues of X 
type did a T tool report on a (pkg, pkgversion, tool, toolversion) tuple? 
how does it compare to the same package and same version but with a newer or 
older version of the tool? or with a different set of options (e.g. 
experimental options)? How about two tools reporting an issue in the same 
line of code? has the number of issues reported by a given tool decreased 
over time? is there a bump on .0 versions? etc.

And that's just the tip of the iceberg, but the real problem is proper job 
scheduling and data processing (to e.g. generate the "dumb html" reports or 
combined views, etc.)

So here's where, I believe, all three projects meet: there is currently no 
proper infrastructure for doing that kind of thing.

For DACA I've an initial implementation of such a system using gearman jobs 
as a way to do everything from notifying of a new package version to 
responding to that "job" and triggering other jobs ("get the list of tools", 
"call every tool", "store result", "notify of new result", "get the list of 
result analysers", "trigger new jobs for every single tool", etc).
This started well and the idea seemed good at first sight as you can connect 
multiple job servers and workers and do all that stuff; but that's just one 
part of it.
It seems like what is actually needed is something like hadoop & friends, 
and that's the point where I'm currently stuck with DACA at. We don't even 
have the proper stack in Debian.

What do you think about all this? Do you consider that it would be better to 
re-think a little more the proposals and try to come up with something 
bigger (but split so that more than one student can work on it)?

Cheers,
-- 
Raphael Geissert - Debian Developer
www.debian.org - get.debian.net