Offline copies of wikipedia

By | Nov 25, 2009 | kiwix, shuttleworth, wikipedia

I have been involved for a number of years with Hilton Theunissen and the Shuttleworth Foundation and their efforts to bring computers to township schools. A part of that software suite was an offline copy of wikipedia.

Early attempts

I have blogged before about my own project Wizzy Digital Courier putting thin client labs down in South African classrooms. That also included a copy of the english language wikipedia.

Initially in 2003 I took the whole of the then-existing English wikipedia, installed a copy of the mediawiki software in conjunction with mysql and apache as database and webserver respectively. The whole thing was around 18 Gigabytes- quite a handful.

It worked well, but I had various complaints on the unsuitability of the material - it was a single snapshot, and had not (could not) be proofread, soit had vandalism, and quite explicit articles around sex. Oops. But - it had search, it had a vast amount of useful information on all manner of subjects.

Very soon the English wikipedia bloomed to hundreds of Gigabytes, making it completely unmanageable in terms of size. I couldn’t download it, and I couldn’t proof it. What to do ?

Wikipedia Version 1.0

Wikipedia has its own community of people, and among them I connected withAndrew Cates, of the SOS Children website, who came up with a selection of articles (1000 or so) as an HTML dump that he and some others painstakingly proof-read for suitability as a children’s educational resource. (He has a larger article collection now).Jonathan Carter helped package this for the tuXlab installs for the Shuttleworth Foundation.

There is a lot of work to do preparing such a collection.

Which articles should be selected ?
all the article text in the HTML dump must be stripped of links to articlesnot in the dump
it must be proof-read
associated pictures must be incorporated

people became involved, in particular Martin Walker from the State University of New York at Potsdam. Systems were put in place to help on article selection. A project was started - the Wikipedia Version 1.0 Editorial Team. Articles were assessed

both for quality (from Featured Article down to Stub) and for importance (from Top to None). These assessments are placed on the article Talk page, and a robot goes through them all on a regular basis and collects the results on project pages like this and this
conveniently doing all the heavy lifting to present sortable tables of the state of all the articles.

Thus you can find Top Importance articles of poor quality, and can highlight that page for improvement. You can cherry-pick Featured Articles to add to the collection. These tools made the article selection process far more manageable.

Offline wikipedia was identified as a priority by the wikimedia foundation.

I assisted in the post-processing by writing a script that would search all the chosen articles for ‘badwords’ - an indication that the article has been vandalised - and then a cleanup crew has to go through these to check and possibly remove material.

Now to package all this conveniently. A French company called Linterweb came up with Okawix - all these articles and pictures packaged in a file, with a cross-platform reader to navigate the collection. Why do we need a reader ? To implement search.

Search

For many of the places we put an offline wikipedia down, it became the ‘Internet’ for the children in the classroom. They had no net connection, but the principles of self-paced learning, hyperlinks for tangential information, and other net paradigms made it the ‘killer app’ of their little school. For Internet, you need search. For a wikipedia collection of a few thousand articles, you need search. Search needs a computer - you cannot put search on a CD or DVD or USB stick.

For a standalone computer, a Reader is needed to perform this function. Froma basic HTML dump, navigated with a browser, Javascript can be pressed into service, but I have found it inadequate. In the tuXlab Thin client labs, a small network of old computers is networked to a powerful server - and I want the wikipedia collection to be browsed via HTTP, and the search to be performed server-side.

Metadata

In computer jargon, this is called metadata. It is structure beyond the mere linking of articles. There is other metadata - like the Importance assessment scale. We need to extract all that metadata and place it alongside the article tree so it can be used for indexes.

On the 24th November we had an IRC meeting - an online chat between all interested parties spread around the world. Much of this was discussed - and one thing became clear - the Wikimedia foundation needs to concentrate more on the process of generating a release, rather than the end product like Okawix. That means tools to work with the Metadata, tools to package the pictures and article references in such a way they can be optional. Perhaps targeted article collections, like Mathematics, Chemistry, Africa, Oceans. Let other organisations do the work of packaging and marketing.

To allow computers to do the work - we need good metadata. Assess articles. Rugby articles are not Top importance, except in the context of sport.

I think the article collection should consist of a number of different pieces, to be incorporated as necessary.

The text of the selected articles
Pictures for those articles
metadata to support this collection
a text search index, like one created by namazu, for those tools that can use it.

Future efforts

Though a lot of effort on offline wikipedia collections is targeted at schools, and Third World, there are other target markets. One we have not really addressed yet is the cellphone as a wikipedia platform. A cellphone implies connectivity, but these days it is becoming a universal platform - a camera, a music player, a gaming box, GPS. Personally, I would like a text-only wikipedia collection of lots of articles, but only the lede paragraph- the first section of a wikipedia page that introduces the subject. It is a song by Black-eyed Peas. It is a river in Poland. That way, I can carry all of this on my phone without paying airtime.

Cellphones

Cellphones have huge penetration in the Third World. I tell tourists I take to the townships that South Africans spend their money on cellphones and hair. Maybe we should concentrate there, as much as schools ?

Update:

Okawix offline reader standardises on OpenZIM data format.

Okawix has transformed into Kiwix

Offline copies of wikipedia

Early attempts

Wikipedia Version 1.0

Search

Categories

Metadata

Future efforts

Cellphones

Update:

Top Posts

Recent Posts

tags