[Pkg-mediawiki-devel] [Xmldatadumps-l] WP-MIRROR 0.6 Released

wp mirror wpmirrordev at gmail.com
Mon Jun 3 15:31:05 UTC 2013

Dear Ariel,

1) Profiling

WP-MIRROR 0.6 saw the introduction of a `--profile' command-line
option.  Which provides a detailed breakdown of where time is spent
during a mirror build.  Unfortunately, WP-MIRROR 0.5 did not have this
feature, so only aggregate comparisons are possible.

2) Performance studies

Much of the Winter and Spring was devoted to time trials.  These are
all documented in WP-MIRROR 0.6 Reference Manual, Appendix G.  The
following may be very interesting for you

2.1) G.8 Experiments with InnoDB data compression

Currently, InnoDB offers two on-dist storage formats, Antelope and
Barracuda.  The later offers data compress.  I performed experiments
on many of the largest wiki tables (e.g. categorylinks, image,
langlinks, pagelinks, templatelinks, text) to determine the space
savings and time penalty. For a summary of results, please take a look
at Table G.1 Database table size v. ROW_FORMAT and KEY_BLOCK_SIZE.

2.2) G.9 Experiments with commonswiki.image

For several reasons, I wanted to import commonswiki.image as part of
the mirror build process.  That table is large and takes time to
import. However, once imported WP-MIRROR 0.6 saves a greater amount of
time elsewhere (e.g. I do not have to scrape the XML dumps for image
file names).

Because of the size of commonswiki.image, I performed a lengthy series
of experiments to determine the fastest way of importing it.  Many of
the best methods make use of features first offered in MySQL 5.5 and
InnoDB 1.1 (e.g fast index creation).  For a summary of results,
please take a look at Figure G.1 InnoDB Experiments Importing
commonswiki.image Table.

3) Documentation

The above mentioned WP-MIRROR 0.6 Reference Manual may be found at
<http://www.nongnu.org/wp-mirror/manual/>.  It is also included in
with the DEB package.

4) Questions

4.1) mwxml2sql

While reading the code for `mwxml2sql.c', I noticed `-n, --nodrop'
option which sets  KEY_BLOCK_SIZE=16 for the `text' table.  From my
experiments (see Appendix G.8) I do not think that any compression
takes place.  KEY_BLOCK_SIZE=4 might be a better choice.


For the initial mirror build, INSERT INTO commands get the job done.
However, for updates to an existing mirror, I would prefer not to DROP
TABLE every time.  For this reason, I rewrite the INSERT INTO commands
into REPLACE INTO commands.  This rewrite works fine.

What I would like to know from you is this:  for which table is INSERT
IGNORE a better choice than REPLACE INTO.  I can see that the
`revision' and `text' table are candidates because they are never
updated, only added to.  But what about the other tables?  Any advise
would be appreciated.

Sincerely Yours,

On 6/3/13, Ariel T. Glenn <ariel at wikimedia.org> wrote:
> Στις 03-06-2013, ημέρα Δευ, και ώρα 10:22 -0400, ο/η wp mirror έγραψε:
>> Dear list members,
>> I am pleased to announce the release of WP-MIRROR 0.6.
>> The main design objective was this:  PERFORMANCE.  WP-MIRROR 0.6 now
>> builds the `enwiki' (which is the most demanding case) with 80% less
>> time and 75% less memory than v0.5.
> Sounds great! Can you give us some benchmarks?  I'm particularly
> interested in length of time with the old and nw versions of your
> package for the various stages of setting up a dump of current pages for
> the English language Wikipedia, on whatever hardware you are using for
> testing.
> Ariel

More information about the Pkg-mediawiki-devel mailing list