[Buildd-tools-devel] RFC: Creation of a PostgreSQL database schema for wanna-build data

Sun Jan 25 19:05:58 UTC 2009

On Sat, Jan 24, 2009 at 05:58:46PM +0100, Adeodato Simó wrote:
> Hello, Roger, thanks for pushing this forward!
> 
> > We would be grateful for any comments you might have, particularly
> > if you have used the existing wanna-build and can see any defects,
> > omissions, or any suggestions for anything we missed or could
> > implement in a more optimal fashion.
> 
> Okay, I have a number of comments, suggestions, wishes, and patches.
> They're quite a few, so please take your time. Hopefully more people can
> join, letting us know what they think of them, and verifying I'm not
> talking shit.

Most of the suggestions are good and well reasoned, though I'm not sure
I agree completely with all of them.  However, I don't make any claims
to being a database expert, so I'm always willing to change to doing
things another way if I've not fully considered the best way to do
something.

I applied all three of the patches you sent.  The current state of the
schema is at this head:

  http://git.debian.org/?p=users/rleigh/sbuild.git;a=shortlog;h=refs/heads/wb-backend

with

  http://git.debian.org/?p=users/rleigh/sbuild.git;a=tree;f=db;h=d16c7e18d8bde6349fbd2323f256767dfc92794a;hb=692678942060152824a727b20b9aa69e14444191

being the current tree.  As I mentioned on IRC, this tree contains three
implementations of the debversion type for Postgres, implemented in
PL/Perl, PL/PerlU and C.  Once we have decided on the appropriate
speed/security/code duplication tradeoff, we can pick one of these to
go with.

> Also, I'm happy to provide patches for all of these issues once we have
> reached an agreement on the list, so please don't take all this as a
> "fix it fix it fix it" mail, and feel free to just comment on these
> issues. OTOH, if you prefer to prepare the patches yourself, I'm very
> much okay with that too, though I'd appreciate if you posted them here
> first (with git-send-email or similar), so there's a chance for review
> before committing.

No problem.  One thing following from your patches and this thread is
that there is a lot of similarity between the dak projectb schema and
this one.  As a result, I think it makes sense to do things like dak
does where possible, to make things as compatible as possible.  Quite
a few of your suggestions take us in this direction anyway.

I've attached a copy of some changes I make to do this, but this is
just for a couple of tables so far.  Comments?

> So, in no particular order, let's go over them:
> 
> 1. Permission handling. I see nothing about this in the design, but I
>    thought we'd be needing something in the DB? (In particular, I don't
>    have a clear picture how updates to the DB are going to be handled,
>    eg. vorlon at raff giving back a package under his UID.)
> 
>    Some thoughts for this: I would like for the builder table to have
>    "arch" and "admin" columns, so that buildd ownership can become the
>    primary source of information about who can do what. And then we need
>    a place to store permissions for users that don't admin buildds, or
>    extra permissions for certain admins.
> 
>    "admin" in builder should be a foreign key to a developers or similar
>    table, which should have at least an "address" NOT NULL column (since
>    I don't want to duplicate the information of who should be behind
>    each of the <arch>@buildd.d.o aliases).

I hadn't considered this.  Your suggestion for the builder table sounds
fine.  Would it be appropriate for non-admin developers to have the
ability to do some tasks (such as sheduling a binNMU on all buildds)?

> 2. [trivial] I'd like for the builder table to have an additional
>    "address" NOT NULL column, as to not duplicate the list of builders
>    in the DB and the aliases file.

Sure.  Where is the aliases file currently located?  Is this just a
standard /etc/aliases file?

Does this want to be a one-to-many mapping to cope with multiple
admins for a system?

> 3. I'm doubtful about the "all" and "any" rows in the architectures
>    table. architectures is being referenced from tables with Packages
>    or Sources information, but also from stuff like build_jobs (and, as
>    per above, possibly builder too). And "all" or "any" don't really
>    make sense for this latter case.

>    Maybe we should create a view "pkg_architectures" that grabs
>    architectures and unions "any" and "all" on top of it, and have
>    sources and binaries reference the view? (Can a foreign key point to
>    a view?)

I'm not sure, I'll have to check.  If it's not possible, then we could
have two separate tables for the two separate purposes.  Or alternatively,
add an extra column to architectures to determine where it may be used
(and add a constraint on that).

>    A case could be made that "all" really belong in architectures, since
>    there could be an autobuilder for arch:all packages, but we can
>    easily change that later on. (And I don't think an arch:all
>    autobuilder would be handled that way anyway.)

Agreed, you would probably just configure one architecture to build arch-all
packages in addition to arch-any.  I'll check up on the use of views (do you
have an example of how to union data in a view?).  My current thought is that
the primary key would be a bit screwy, but maybe it would be OK.

> 4. Would it be useful to have the priority sortable from within the
>    database? Then it doesn't have to be done in software. (This is just
>    an idea, and I'm not completely sure it's a good one. Note: if we do
>    this, ensure auto-insertion works well with it.)

We can easily add a numerical priority and sort on that.  I guess we could
just default the priority to allow easy insertion at a low priority.

> 5. [nitpick] Could we do without listing all package sections? They'll
>    be auto-inserted anyway.

Yes, I already added the auto insertion, which adds package sections and
priorities if missing:

============================================================================
CREATE OR REPLACE FUNCTION package_checkrel() RETURNS trigger AS $package_checkrel$
BEGIN
  PERFORM name FROM package_sections WHERE (name = NEW.pkg_section_name);
  IF FOUND = 'f' THEN
    INSERT INTO package_sections (name) VALUES (NEW.pkg_section_name);
  END IF;
  PERFORM name FROM package_priorities WHERE (name = NEW.pkg_priority_name);
  IF FOUND = 'f' THEN
    INSERT INTO package_priorities (name) VALUES (NEW.pkg_priority_name);
  END IF;
  RETURN NEW;
END;
$package_checkrel$ LANGUAGE plpgsql;
COMMENT ON FUNCTION package_checkrel ()
  IS 'Check foreign key references (package sections and priorities) exist';

CREATE TRIGGER checkrel BEFORE INSERT OR UPDATE ON sources
  FOR EACH ROW EXECUTE PROCEDURE package_checkrel();
COMMENT ON TRIGGER checkrel ON sources
  IS 'Check foreign key references (package sections and priorities) exist';

CREATE TRIGGER checkrel BEFORE INSERT OR UPDATE ON binaries
  FOR EACH ROW EXECUTE PROCEDURE package_checkrel();
COMMENT ON TRIGGER checkrel ON binaries
  IS 'Check foreign key references (package sections and priorities) exist';
============================================================================

I'll remove the redundant list.

> 6. Maintainer and Uploaders handling. I would like for "uploaders" to be
>    a proper one-to-many relationship, because the new web frontend will
>    be querying by uploader.

Agreed.  This is already how dak does it, and it's the best way to do it.
We should just copy what dak does here.

>    Also, I don't think we need per-version maintainer/uploaders
>    information. Instead, the tool that populates the DB should just use
>    the information from the most recent version across suites.

This might make insertion of data quite slow.  The current schema would
allow one to write everything from Pacages/Sources into binaries/sources
and then update the dist_binaries/dist_sources to update the current
packages in a given component/section.

On the other hand, if we were to write the entire Packages/Sources into
a temporary table, it could be fairly simple to do the version checks.

>    (Oh, and we should just keep the maintainter/uploaders *addresses*.)

What is the reasoning for this?

> 7. [distant future, no action needed now, but see the last paragraph in
>    this item] I think we should consider killing the P-a-s file as
>    input, and put that information in the DB. (P-a-s or similar would
>    still be created as output for third parties.)
> 
>    Then, source_architectures would be a view of (eg.)
>    source_arch_from_control and source_arch_override (P-a-s).
> 
>    (Or, well: without the above, where is P-a-s information merged into
>    the DB? Directly into source_arch? Maybe we can retain P-a-s as
>    input, but use already the two-table scheme I mentioned in the
>    paragraph above, writing P-a-s into source_arch_override.)

I think having a general mechanism to dump/restore P-a-s would be useful.
I think having a separate list of arches (control and override) is
sensible.  I also like the idea of having a view on this, but depending on
what we want to do, we might again need to check about foreign keys and
views.

> 8. Is binary package information needed for anything else than
>    dep-waits? </ignorance> Does that table need section and priority?

I don't think it's strictly needed, but might possibly be useful for
front-ends.  What do others think?

> 9. History. In your introductory mail, you mention that the database is
>    meant to keep full history; I infer this mainly refers to the
>    "build_jobs" table. I have reservations about this design (not about
>    the idea of keeping full history itself), which I'll explain now.
> 
>    I don't like very much the idea of keeping the state change history
>    (which is what build_jobs seems to do) together with the information
>    about the current (most recent) state. One concern would be
>    performance: how will fetching the current state for one package
>    perform in 5 years time? (This concern may just be due to me being
>    ignorant about PostgreSQL real capabilities. Also, I see you
>    mentioned pruning the table on IRC; see below about this.)

Indexing on source package and time would allow one to easily pull out
the last (current) job state.

There is also another concen about space usage.  If we never delete old
data, the database could blow up to quite a large size.  While it's
hard to predict just how much space would be used until it's tested,
I think having a cron job running weekly/monthly to remove older data
would mitigate this (and also have the side effect of making queries
faster).

If you check the schema, you'll see quite a few of the foreign key
references have ON DELETE CASCADE added to the constraints.  For
the build jobs, this means that if an architecture or component is
removed, all jobs referring to the deleted arch/component in this
table will automatically be deleted.  Additionally, when a source
package is removed from sources, the state changes associated with
that pacakge will auto trigger automatic deletion.  As a result,
keeping /n/ previous revisions and/or having a time limit on
source package lifetime will result in automatic pruning of the
jobs table.

>    Another concern would be clumsiness of queries. For example, how
>    would one obtain the current state in unstable for packages foo, bar,
>    and baz? (I know current state is easy to obtain for a single
>    package, with eg. "ORDER BY ctime DESC LIMIT 1", but I can't figure
>    out how to get it for multiple packages with one query. And the
>    front-end *will* want to query for multiple packages. This, again,
>    may be a lack of SQL knowledge on my part, apologies if that's the
>    case.)

My SQL knowledge is by no means expert, but I think you could use
GROUP BY and an aggregate function such as max() to obtain the
information.  However, I think a view on this would provide a much
nicer interface :)  And views also have the advantage that the
database can optimise things behind the scenes.

>    What I had imagine we would have, prior to reading this design, was a
>    build_state or similar table, that just records the current state for
>    each source/distribution/arch. And a separate build_state_history
>    table that would record every state change.

With a view, we would have just this.  OTOH, if this turns out to
be inefficient and/or ugly, I would not object to another table.

One question I do have is exactly what we want to store in build_state_history?
Would it be a subset or superset of what's in build_jobs?

>    On IRC, there was talk about periodically pruning the sources table,
>    and hence pruning the history on build_jobs as well. I think sources
>    should always be prunned as soon as a version is no longer present in
>    any of the distributions, but I also think build_state_history
>    shouldn't be prunned at all, and it should just lose referential
>    integrity after prunning sources. (This is of course my opinion, and
>    the matter is open for discussions.)

I'm not sure that's possible with postgres WRT losing referential
integrity.  You can choose to set the column to NULL or DEFAULT, but you
can't keep the invalid column(s) when the reference goes away.

We could for this table not have any referential integrity at all I
guess, but that might make cleaning it up annoying, as well as
slowing down joins when querying it.

>    Marc said he wouldn't know what to do with that state change history,
>    but I'd say "why not have it?". There's always time for pruning if
>    nobody ever finds a use for it, or it takes up too much space.
> 
>    Thoughts?

On buildd.debian.org, you can currently look at the history going back a
few versions for packages on all arches (this is just the build logs).
Having a more detailed view of that would be useful for maintainers to
check for problems with their packages being built, and also for looking
at the performance of buildds themselves, such as the delay between
needs-build and building, build times, etc.

> 10. Build logs. I'd really like for the DB to store a table with all the
>     available logs: source, version, arch, date, state, and path.

This could be stored as a separate table, or in build_job_properties.
A separate table would probably be cleaner.

I'd also like to store the build time and space statistics (currently
stored by sbuild in a GDBM database) stored here along with the logs.

> 11. I see from your 38dd87a commit message that the "log" table is meant
>     for "storing changelog". I've searched the IRC logs of your
>     conversations about this schema, but I couldn't find a reference to
>     it. Storing changelog of what, precisely?

This is to store what's currently in the "transaction log", i.e. a
log of all main database activity such as package additions, state
changes etc.  Perhaps the database isn't the best place for this, but
since you might be connecting remotely, I wasn't sure if a logfile
was that great either.  This would record "when", "who" and "what"
for each main change, useful for debugging or tracking down mistakes.

I'll be happy to remove it if this is a Bad Idea.

> 12. binNMU handling. Marc mentioned en-passé that the binNMU
>     version/level should be a column in build_job/build_state, and I
>     agree. I'm just not very sure if we want history of these, and how
>     they'd fit with/into build_state_history.

I'm not sure why you wouldn't want to store them in the history.  Surely
they should show up as a needs-build state change to trigger the build,
with subsequent state changes (uploaded/installed etc.) tracked?  Since
you can't store them in packages, I think they should be installed in the
job/state record (perhaps defaulting to NULL for the typical non-NMU
case).

> 13. The _name suffix everywhere annoys me a bit, since I like having
>     tables that can be natural join'ed, or using(foo) join'ed. Maybe
>     this is just me, and the _name suffix is excelent SQL practice, but
>     I thought I'd mention to see what other think.

If it makes queries simpler, and more understandable, then I'm all for it.

> 14. I'm attaching 3 trivial patches that hopefully nobody will find
>     objectionable.

These are now applied.

> 15. Could we name the database "wannab"?

I'm not too bothered what it's called.  Any other suggestions?
On the wanna-build side, this will be made configurable in any case,
so you can name it whatever you like in the connect string.

Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: db-dakconv.patch
Type: text/x-diff
Size: 9717 bytes
Desc: not available
Url : http://lists.alioth.debian.org/pipermail/buildd-tools-devel/attachments/20090125/473fc9df/attachment.patch 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url : http://lists.alioth.debian.org/pipermail/buildd-tools-devel/attachments/20090125/473fc9df/attachment.pgp