[Teammetrics-discuss] Updating commitstats

Andreas Tille andreas at an3as.eu
Tue Apr 23 09:51:32 UTC 2013


Hi Sukhbir,

On Tue, Apr 23, 2013 at 12:45:55AM -0400, Sukhbir Singh wrote:
> > status on several places is (obviously) the wrong approach.  If we build
> > a database on two different hosts (say for testing purpose or as we do
> > now an initial import) this will break import data.  Simply assume I
> > would create a test database instance at home and will run a daily
> > import.  The consequence would be that the status on vasks.d.o will be
> > updated daily and if our production import runs at the beginning of a
> > month it will see no new commits.  That's broken design.
> 
> I think I couldn't get my message across clearly. Maybe I did, but let
> me reiterate it:
> 
> The SVN data is stored in the database *only* (on blends) and the state
> is saved on vasks.

I *perfectly* understood this.  Whatever you store on vasks is
*metadata* that is obviously *relevant* for the datastore we are
creating (which was just proven by the initial database creation on the
new host).

> The reason we did this is because we were taking into
> account the number of lines changed for SVN commits, which is an
> intensive task for repositories that have thousands of commits and
> thousands of lines changed per commit.

Whatever might be the reason - the data should *not* be on vasks.  The
simplest way to approach this is to rsync / scp these metadata to the
same host that is creating the database (=blends.debian.net).

> So when we run svnstat, instead of fetching the state from blends, we
> just get it locally from vasks, and if a commit is found, we skip it and
> don't parse it (and therefore don't send it back).
> 
> > The consequence needs to be that we do the housekeeping *inside* each
> > database because the housekeeping and the data belong together and it
> > actually needs to be done in a *transaction* to make sure that the
> > housekeeping will fit the data status exactly.
> 
> I am not sure what you mean, but there is just one database that we
> populate and that is on vasks. Also, if a commit has already been parsed
> on vasks, it is not sent back, so the consistency remains.

Assume we have *two* hosts trying to create the teammetrics database:

  1. blends.debian.net
  2. competing.testhost.at.home

Blends.debian.net runs the update job once per month and
competing.testhost.at.home runs the update at random times - say two
times a weak.  *Both* hosts are updating the status on vasks.  If
competing.testhost.at.home runs the SVN statistics right now and
blends.debian.net runs on May 1st as usual it gets only those SVN
commits that were done between now (2013-04-23 11:45) and May 1st
because it wrongly trusts the database on vasks that all data from
April 1st until now are just recorded.  But they were recorded from
competing.testhost.at.home and not from blends.debian.net.  So
both hosts that are trying to build a database get only a portion
of the data they need to fetch.

> So technically, yes, I can just decide to save the state entirely on
> blends, but I am very sure that there was some reason we didn't do this.

IMHO the only reason we decided that way could be that we never thought
about such competing hosts.  I do not think that this situation is very
probable but I admit I had considered this for some time and if I would
have been in Vipin shoes I would have definitely done so.  Our initial
database creation is just a special case of the situation which is bound
to fail.

> In fact, I had to put in extra work just to handle this special case of
> saving the state on vasks for SVN commits.

Hmmm, I do not remember this specific thing.  Sorry.

> > So the very quick hack to cure the situation above would be to also
> > store the svn data in /var/cache/teammetrics.
> 
> Ok. I can do that.

I'd be happy if you would do this.

> > The "real" solution would probably as I mentioned briefly in my past
> > mail that we need to store also these data inside the database rather
> > than in /var/cache/teammetrics.  This would enable us to do clean
> > backups of the database.  OK, with some proper backup method we could
> > also keep the dir /var/cache/teammetrics - but hmmm, I'm somehow lacking
> > the motivation to keep one part of the data in files and the other part
> > inside the database.
> 
> So do you mean to save the state in the database too? Other than the
> backup thing, is there any reason why you would want to do that?

I was just keeping consistency in mind.  IMHO it is a good principle to
keep your data in one place and also these house keeping metadata are
some data that are relevant not only for logging purpose (as we do see).
So I have no idea how much time you want to invest into this but if I
would restart now for the GSoC project I would create the design this
way.

> > BTW, in the debian-l10n team there need to be some names adjusted.
> > There is some Nicolas_F and Nicolas_F? as well as a user fzt.  In
> > debian-science there is sebastien-guest and sebastien and barbier-guest.
> 
> I will update it.

Thanks.

Kind regards

       Andreas.

-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list