[Pkg-mediawiki-devel] [Xmldatadumps-l] Preparing for wp-mirror-0.7
gnosygnu
gnosygnu at gmail.com
Sat Jan 11 19:11:59 UTC 2014
Hi Kent.
I'll respond to a few points. The rest are outside my knowledge / scope
(EX: mwxml2sql; Debian).
Hope this helps, and good luck on wp-mirror.
----
> interlanguage links have been removed to the wikidata project, the
rendering of which requires mediawiki-1.21+;
You will need a local instance of http://www.wikidata.org . You could
probably build it using the files from here:
http://dumps.wikimedia.org/wikidatawiki/
Each wiki would also need the wikibase extension:
https://www.mediawiki.org/wiki/Extension:Wikibase.
Note that wikidata is also being used in infoboxes. For example,
https://simple.wikipedia.org/w/index.php?title=Google&action=edit now has
multiple {{#property}} statements.
> infoboxes now require the scribunto extension which requires
mediawiki-1.20+
I believe most of the wikis have moved the infobox generation from
Templates to Modules. They've moved a lot of other functionality as well
(for example: references and message boxes).
The Scribunto extension is at
https://www.mediawiki.org/wiki/Extension:Scribunto . It transforms a
{{#invoke:Module_name|function_name|arguments}} into the appropriate text.
> category - dump files now have 5 fields, whereas the database schema has
6 fields;
I believe they removed the cat_hidden field which was effectively
deprecated. A category's hidden status is saved in page_props
> The large image dump tarballs are now a year old.
I raised this issue back in July. See here for Kevin Day's response (from
your.org):
http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-July/000861.html
I think there's still some infrastructure work that needs to be done on the
Wikimedia side.
> We are beginning to see thumb dumps from the xowa project.
I've been uploading thumbs for the major wikis to archive.org. See here for
a summary: https://archive.org/search.php?query=xowa
I'm planning to upload all thumbs for all the major languages that are
listed as > 200,000 on https://en.wikipedia.org/wiki/Main_Page . There are
roughly 27 languages listed there. My progress has been about 1 wiki per
week (I'm also uploading sister wikis for a given language). I've done 4 so
far. I'm hoping to be done with the other 23 sometime by mid-year.
At about that time, I'm hoping to have a more automated way of generating
updates. Currently, I'm only releasing monthly updates for
en.wikipedia.organd quarterly updates for the other main wikis.
Note that the thumbs are uploaded as sqlite databases. I chose sqlite b/c
tarballs are slower to extract / update / query. The database schema is
fairly basic, and you should be able to retrieve any file with a sqlite
library.
On Fri, Jan 10, 2014 at 2:43 AM, wp mirror <wpmirrordev at gmail.com> wrote:
> Dear Ariel,
>
> Happy New Year. I am gearing up for wp-mirror-0.7. To that end, I would
> like to list some issues that I see; and I would like to offer my help in
> solving them.
>
> 0) Problem Statements
>
> 0.1) Page Rendering. Wp-mirror-0.6 works well in the sense that it builds
> a faithful mirror of any of your wikis. However, during 2013 the rendering
> of pages eroded materially. For example,
>
> o interlanguage links have vanished both from rendered pages and from
> dump files;
> o infoboxes are no longer rendered;
> o most transclusions now render as redlinks even though the templates
> are easily found in the underlying database; etc.
>
> I understand that this erosion occurred because wp-mirror-0.6 still uses
> mediawiki-1.19, whereas WMF has moved on to mediawiki-1.23. For example, I
> understand that:
>
> o interlanguage links have been removed to the wikidata project, the
> rendering of which requires mediawiki-1.21+;
> o infoboxes now require the scribunto extension which requires
> mediawiki-1.20+
>
> 0.2) Database Schema. Some differences in database schema have appeared.
>
> o category - dump files now have 5 fields, whereas the database
> schema has 6 fields;
> o exterallinks - dump files now have 4 fields, whereas the database
> schema has 3 fields.
>
> Loading these two tables generate the error message: ``Column count
> doesn't match value at row 1.''
>
> 0.3) Version Lifecycle. According to <
> http://www.mediawiki.org/wiki/Version_lifecycle> mediawiki 1.23 LTS is
> slated for May 2014. However, the Debian packaging team is silent as to
> their plans for a transition from mediawiki-1.19 LTS to mediawiki-1.23 LTS.
>
> 0.4) Image Dumps. The large image dump tarballs are now a year old. This
> means that, while wp-mirror still downloads the bulk of its images from
> these tarballs, there are a growing number that must be downloaded
> individually from WMF.
>
> 0.5) Thumbs. One person has asked me if dump files of thumbs could be
> made available. We are beginning to see thumb dumps from the xowa project.
>
> 0.6) IPv6. I am glad to see that <gerrit.wikimedia.org> has an IPv6
> address. However, <bastion.wmflabs.org> still does not. My internal
> network is IPv6 only.
>
> 1) mwxml2sql
>
> This utility from Ariel Glenn has proved invaluable to the wp-mirror
> project. This utility, together with MySQL 5.5 fast index creation, allows
> wp-mirror to build mirrors much faster than before (80% less time).
>
> 1.1) Need for update. According to its version information, mwxml2sql may
> only be valid through mediawiki-1.21.
>
> (shell)$ mwxml2sql --version
> mwxml2sql 0.0.2
> Supported input schema versions: 0.4 through 0.8.
> Supported output MediaWiki versions: 1.5 through 1.21.
>
> Whereas, I am looking forward to mediawiki-1.23 LTS (see below), I would
> like to know if mwxml2sql should be updated.
>
> 1.2) Help Offer. If mwxml2sql does need updating, I would be happy to
> help with this; and to package it for Debian as I have done before. Perhaps
> we could call it mwxml2sql-0.0.3.
>
> 2) mediawiki-1.23 LTS.
>
> 2.1) Vision. I would like wp-mirror-0.7 to be able to build a mirror that
> serves pages that look no different than those served by WMF.
>
> 2.2) DEB package. To that end, I am thinking of packaging mediawiki-1.23
> together with the extensions needed for rendering WMF wikis with wikidata
> content, infoboxes, math, transclusions, etc. Given WMF's ``continuous
> integration'' development model, I would like to be able to automatically
> generate a tarball and DEB package each time WMF pushes an update to its
> servers.
>
> 2.3) Debian package repository. Such a DEB package would be distributed
> with wp-mirror. In preparation for this, I have set up a Debian package
> repository at <http://download.savannah.gnu.org/releases/wp-mirror/>. It
> is currently used to distribute wp-mirror-0.6 and an unstable version of
> wp-mirror-0.7. Home page <http://www.nongnu.org/wp-mirror/>.
>
> 2.4) Help Offer. I am happy to do most of this work myself. However, I
> will need some guidance on interacting with the appropriate GIT
> repositories. I hope that you can put me in touch with someone involved in
> the ``continuous integration'' process.
>
> 3) Media dumps
>
> I am thinking that updating the image dumps annually would be adequate.
> Including thumbs in those dumps would materially assist the off-line
> community. I could easily update wp-mirror-0.7 to give the user a choice
> (no media files, thumbs only, full size media files).
>
> Sincerely Yours,
> Kent
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-mediawiki-devel/attachments/20140111/11686f9f/attachment.html>
More information about the Pkg-mediawiki-devel
mailing list