[Pkg-mediawiki-devel] [Xmldatadumps-l] Preparing for wp-mirror-0.7

gnosygnu gnosygnu at gmail.com
Sat Jan 11 19:11:59 UTC 2014


Hi Kent.

I'll respond to a few points. The rest are outside my knowledge / scope
(EX: mwxml2sql; Debian).

Hope this helps, and good luck on wp-mirror.

----

> interlanguage links have been removed to the wikidata project, the
rendering of which requires mediawiki-1.21+;
You will need a local instance of http://www.wikidata.org . You could
probably build it using the files from here:
http://dumps.wikimedia.org/wikidatawiki/

Each wiki would also need the wikibase extension:
https://www.mediawiki.org/wiki/Extension:Wikibase.

Note that wikidata is also being used in infoboxes. For example,
https://simple.wikipedia.org/w/index.php?title=Google&action=edit now has
multiple {{#property}} statements.

> infoboxes now require the scribunto extension which requires
mediawiki-1.20+
I believe most of the wikis have moved the infobox generation from
Templates to Modules. They've moved a lot of other functionality as well
(for example: references and message boxes).

The Scribunto extension is at
https://www.mediawiki.org/wiki/Extension:Scribunto . It transforms a
{{#invoke:Module_name|function_name|arguments}} into the appropriate text.

> category - dump files now have 5 fields, whereas the database schema has
6 fields;

I believe they removed the cat_hidden field which was effectively
deprecated. A category's hidden status is saved in page_props

>  The large image dump tarballs are now a year old.

I raised this issue back in July. See here for Kevin Day's response (from
your.org):
http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-July/000861.html
I think there's still some infrastructure work that needs to be done on the
Wikimedia side.

> We are beginning to see thumb dumps from the xowa project.

I've been uploading thumbs for the major wikis to archive.org. See here for
a summary: https://archive.org/search.php?query=xowa

I'm planning to upload all thumbs for all the major languages that are
listed as > 200,000 on https://en.wikipedia.org/wiki/Main_Page . There are
roughly 27 languages listed there. My progress has been about 1 wiki per
week (I'm also uploading sister wikis for a given language). I've done 4 so
far. I'm hoping to be done with the other 23 sometime by mid-year.

At about that time, I'm hoping to have a more automated way of generating
updates. Currently, I'm only releasing monthly updates for
en.wikipedia.organd quarterly updates for the other main wikis.

Note that the thumbs are uploaded as sqlite databases. I chose sqlite b/c
tarballs are slower to extract / update / query. The database schema is
fairly basic, and you should be able to retrieve any file with a sqlite
library.



On Fri, Jan 10, 2014 at 2:43 AM, wp mirror <wpmirrordev at gmail.com> wrote:

> Dear Ariel,
>
> Happy New Year.  I am gearing up for wp-mirror-0.7.  To that end, I would
> like to list some issues that I see; and I would like to offer my help in
> solving them.
>
> 0) Problem Statements
>
> 0.1) Page Rendering.  Wp-mirror-0.6 works well in the sense that it builds
> a faithful mirror of any of your wikis.  However, during 2013 the rendering
> of pages eroded materially.  For example,
>
>      o interlanguage links have vanished both from rendered pages and from
> dump files;
>      o infoboxes are no longer rendered;
>      o most transclusions now render as redlinks even though the templates
> are easily found in the underlying database; etc.
>
> I understand that this erosion occurred because wp-mirror-0.6 still uses
> mediawiki-1.19, whereas WMF has moved on to mediawiki-1.23.  For example, I
> understand that:
>
>      o interlanguage links have been removed to the wikidata project, the
> rendering of which requires mediawiki-1.21+;
>      o infoboxes now require the scribunto extension which requires
> mediawiki-1.20+
>
> 0.2) Database Schema.  Some differences in database schema have appeared.
>
>      o category - dump files now have 5 fields, whereas the database
> schema has 6 fields;
>      o exterallinks - dump files now have 4 fields, whereas the database
> schema has 3 fields.
>
> Loading these two tables generate the error message:  ``Column count
> doesn't match value at row 1.''
>
> 0.3) Version Lifecycle.  According to <
> http://www.mediawiki.org/wiki/Version_lifecycle> mediawiki 1.23 LTS is
> slated for May 2014.  However, the Debian packaging team is silent as to
> their plans for a transition from mediawiki-1.19 LTS to mediawiki-1.23 LTS.
>
> 0.4) Image Dumps.  The large image dump tarballs are now a year old.  This
> means that, while wp-mirror still downloads the bulk of its images from
> these tarballs, there are a growing number that must be downloaded
> individually from WMF.
>
> 0.5) Thumbs.  One person has asked me if dump files of thumbs could be
> made available. We are beginning to see thumb dumps from the xowa project.
>
> 0.6) IPv6.  I am glad to see that <gerrit.wikimedia.org> has an IPv6
> address.  However, <bastion.wmflabs.org> still does not.  My internal
> network is IPv6 only.
>
> 1) mwxml2sql
>
> This utility from Ariel Glenn has proved invaluable to the wp-mirror
> project. This utility, together with MySQL 5.5 fast index creation, allows
> wp-mirror to build mirrors much faster than before (80% less time).
>
> 1.1) Need for update.  According to its version information, mwxml2sql may
> only be valid through mediawiki-1.21.
>
> (shell)$ mwxml2sql --version
> mwxml2sql 0.0.2
> Supported input schema versions: 0.4 through 0.8.
> Supported output MediaWiki versions: 1.5 through 1.21.
>
> Whereas, I am looking forward to mediawiki-1.23 LTS (see below), I would
> like to know if mwxml2sql should be updated.
>
> 1.2) Help Offer.  If mwxml2sql does need updating, I would be happy to
> help with this; and to package it for Debian as I have done before. Perhaps
> we could call it mwxml2sql-0.0.3.
>
> 2) mediawiki-1.23 LTS.
>
> 2.1) Vision. I would like wp-mirror-0.7 to be able to build a mirror that
> serves pages that look no different than those served by WMF.
>
> 2.2) DEB package.  To that end, I am thinking of packaging mediawiki-1.23
> together with the extensions needed for rendering WMF wikis with wikidata
> content, infoboxes, math, transclusions, etc.   Given WMF's ``continuous
> integration'' development model, I would like to be able to automatically
> generate a tarball and DEB package each time WMF pushes an update to its
> servers.
>
> 2.3) Debian package repository.  Such a DEB package would be distributed
> with wp-mirror. In preparation for this, I have set up a Debian package
> repository at <http://download.savannah.gnu.org/releases/wp-mirror/>.  It
> is currently used to distribute wp-mirror-0.6 and an unstable version of
> wp-mirror-0.7.  Home page <http://www.nongnu.org/wp-mirror/>.
>
> 2.4) Help Offer.  I am happy to do most of this work myself.  However, I
> will need some guidance on interacting with the appropriate GIT
> repositories.  I hope that you can put me in touch with someone involved in
> the ``continuous integration'' process.
>
> 3) Media dumps
>
> I am thinking that updating the image dumps annually would be adequate.
>  Including thumbs in those dumps would materially assist the off-line
> community.  I could easily update wp-mirror-0.7 to give the user a choice
> (no media files, thumbs only, full size media files).
>
> Sincerely Yours,
> Kent
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-mediawiki-devel/attachments/20140111/11686f9f/attachment.html>


More information about the Pkg-mediawiki-devel mailing list