[Buildd-tools-devel] RFC: Creation of a PostgreSQL database schema for wanna-build data

Sat Jan 24 16:58:46 UTC 2009

Hello, Roger, thanks for pushing this forward!

> We would be grateful for any comments you might have, particularly
> if you have used the existing wanna-build and can see any defects,
> omissions, or any suggestions for anything we missed or could
> implement in a more optimal fashion.

> [1] http://git.debian.org/?p=users/rleigh/sbuild.git;a=tree;f=db;h=26f19b8aef033bc07fdba1c33fa32c80b78ce468;hb=38dd87a21b925eca2c936adb9530530b1854cd07

Okay, I have a number of comments, suggestions, wishes, and patches.
They're quite a few, so please take your time. Hopefully more people can
join, letting us know what they think of them, and verifying I'm not
talking shit.

Also, I'm happy to provide patches for all of these issues once we have
reached an agreement on the list, so please don't take all this as a
"fix it fix it fix it" mail, and feel free to just comment on these
issues. OTOH, if you prefer to prepare the patches yourself, I'm very
much okay with that too, though I'd appreciate if you posted them here
first (with git-send-email or similar), so there's a chance for review
before committing.

So, in no particular order, let's go over them:

1. Permission handling. I see nothing about this in the design, but I
   thought we'd be needing something in the DB? (In particular, I don't
   have a clear picture how updates to the DB are going to be handled,
   eg. vorlon at raff giving back a package under his UID.)

   Some thoughts for this: I would like for the builder table to have
   "arch" and "admin" columns, so that buildd ownership can become the
   primary source of information about who can do what. And then we need
   a place to store permissions for users that don't admin buildds, or
   extra permissions for certain admins.

   "admin" in builder should be a foreign key to a developers or similar
   table, which should have at least an "address" NOT NULL column (since
   I don't want to duplicate the information of who should be behind
   each of the <arch>@buildd.d.o aliases).

2. [trivial] I'd like for the builder table to have an additional
   "address" NOT NULL column, as to not duplicate the list of builders
   in the DB and the aliases file.

3. I'm doubtful about the "all" and "any" rows in the architectures
   table. architectures is being referenced from tables with Packages
   or Sources information, but also from stuff like build_jobs (and, as
   per above, possibly builder too). And "all" or "any" don't really
   make sense for this latter case.

   Maybe we should create a view "pkg_architectures" that grabs
   architectures and unions "any" and "all" on top of it, and have
   sources and binaries reference the view? (Can a foreign key point to
   a view?)

   A case could be made that "all" really belong in architectures, since
   there could be an autobuilder for arch:all packages, but we can
   easily change that later on. (And I don't think an arch:all
   autobuilder would be handled that way anyway.)

4. Would it be useful to have the priority sortable from within the
   database? Then it doesn't have to be done in software. (This is just
   an idea, and I'm not completely sure it's a good one. Note: if we do
   this, ensure auto-insertion works well with it.)

5. [nitpick] Could we do without listing all package sections? They'll
   be auto-inserted anyway.

6. Maintainer and Uploaders handling. I would like for "uploaders" to be
   a proper one-to-many relationship, because the new web frontend will
   be querying by uploader.

   Also, I don't think we need per-version maintainer/uploaders
   information. Instead, the tool that populates the DB should just use
   the information from the most recent version across suites.

   (Oh, and we should just keep the maintainter/uploaders *addresses*.)

7. [distant future, no action needed now, but see the last paragraph in
   this item] I think we should consider killing the P-a-s file as
   input, and put that information in the DB. (P-a-s or similar would
   still be created as output for third parties.)

   Then, source_architectures would be a view of (eg.)
   source_arch_from_control and source_arch_override (P-a-s).

   (Or, well: without the above, where is P-a-s information merged into
   the DB? Directly into source_arch? Maybe we can retain P-a-s as
   input, but use already the two-table scheme I mentioned in the
   paragraph above, writing P-a-s into source_arch_override.)

8. Is binary package information needed for anything else than
   dep-waits? </ignorance> Does that table need section and priority?

9. History. In your introductory mail, you mention that the database is
   meant to keep full history; I infer this mainly refers to the
   "build_jobs" table. I have reservations about this design (not about
   the idea of keeping full history itself), which I'll explain now.

   I don't like very much the idea of keeping the state change history
   (which is what build_jobs seems to do) together with the information
   about the current (most recent) state. One concern would be
   performance: how will fetching the current state for one package
   perform in 5 years time? (This concern may just be due to me being
   ignorant about PostgreSQL real capabilities. Also, I see you
   mentioned pruning the table on IRC; see below about this.)

   Another concern would be clumsiness of queries. For example, how
   would one obtain the current state in unstable for packages foo, bar,
   and baz? (I know current state is easy to obtain for a single
   package, with eg. "ORDER BY ctime DESC LIMIT 1", but I can't figure
   out how to get it for multiple packages with one query. And the
   front-end *will* want to query for multiple packages. This, again,
   may be a lack of SQL knowledge on my part, apologies if that's the
   case.)

   What I had imagine we would have, prior to reading this design, was a
   build_state or similar table, that just records the current state for
   each source/distribution/arch. And a separate build_state_history
   table that would record every state change.

   On IRC, there was talk about periodically pruning the sources table,
   and hence pruning the history on build_jobs as well. I think sources
   should always be prunned as soon as a version is no longer present in
   any of the distributions, but I also think build_state_history
   shouldn't be prunned at all, and it should just lose referential
   integrity after prunning sources. (This is of course my opinion, and
   the matter is open for discussions.)

   Marc said he wouldn't know what to do with that state change history,
   but I'd say "why not have it?". There's always time for pruning if
   nobody ever finds a use for it, or it takes up too much space.

   Thoughts?

10. Build logs. I'd really like for the DB to store a table with all the
    available logs: source, version, arch, date, state, and path.

11. I see from your 38dd87a commit message that the "log" table is meant
    for "storing changelog". I've searched the IRC logs of your
    conversations about this schema, but I couldn't find a reference to
    it. Storing changelog of what, precisely?

12. binNMU handling. Marc mentioned en-passé that the binNMU
    version/level should be a column in build_job/build_state, and I
    agree. I'm just not very sure if we want history of these, and how
    they'd fit with/into build_state_history.

13. The _name suffix everywhere annoys me a bit, since I like having
    tables that can be natural join'ed, or using(foo) join'ed. Maybe
    this is just me, and the _name suffix is excelent SQL practice, but
    I thought I'd mention to see what other think.

14. I'm attaching 3 trivial patches that hopefully nobody will find
    objectionable.

15. Could we name the database "wannab"?

Thanks,

-- 
Adeodato Simó                                     dato at net.com.org.es
Debian Developer                                  adeodato at debian.org

Testing can show the presence of bugs, but not their absence.
                -- Dijkstra