Skip to topic | Skip to bottom
Public.MobiPediar1.1 - 21 May 2010 - 19:32 - TWikiGuest? [Zum Ende]

Start of topic | Direkt zum Menü

WPMP - Encyclopedias for the road

Important:

  • There will be no new Release in the foreseeable Future. This Project is currently suspended.
  • Please read the Download-Section to find out what to download and what to do with the downloaded File. Have a look at the table on the right to find out the current suggested (stable) german encyclopedia release.


Latest releases:
Language Stable Experimental
German 2008/06_1e 2008/06_2e1
French 2008/06_1e3 -
English - 2008/06_1e2

News

  • 13.08.2008 Bugfixed german encyclopedia released. These bugs were fixed:
    • Fixed broken links in disambiguations.
    • Fixed invalid characters in Indices (actually a bug in the alias-function of the article parser.
  • 02.07.2008 400k -Version of the german encyclopedia is released. This an eBook containing a filtered set of articles (the 400.000 most viewed). This version needs ~740MB of memorycard space.
  • 02.07.2008 A bug was discovered which makes Articles containing a ":" being removed before conversion. This will be fixed in the next major release of each language.

What's WPMP about?

WPMP is a Project, aiming to provide encyclopedias which can be viewed on mobile computers like cell phones or PDAs. The focus is on a fully automated procedure which make regular updates easy.

The encyclopedias are distributed in a highly compressed eBook-format called Mobipocket which can be read by a proprietary but freely usable reading software by Mobipocket.Com ). However, the data included in those eBooks is free (as in Free speech ). It was compiled from a snapshot of the Wikipedia which was taken in August 2007. As soon as more recent snapshots become available, the WPMP project will release new encyclopedia-eBooks, too. Actually the WPMP eBooks are mobile versions of the Wikipedia - however, the WPMP project is in no way associated to the Wikimedia Foundation (which is the organisation behind the Wikipedia).

All eBooks contain an Index - a search dialog for easy lookup of topics/words.

Which languages are supported?

Latest completed dump id: 2008/06

This is the list of languages in which the WPMP project already released encyclopedia eBooks in:

  • German
  • English
  • French
  • Dutch
  • Italian
  • Bavarian
  • Esperanto
  • Finnish
  • Spanish
  • Portugese

Download and installation

Overview: You'll have to download a single archive file ( .tgz ), decompress it and copy the resulting folder + subfolders on your mobile computer's memory card or internal memory. Make sure to read the whole *download and Installation= section before asking questions.

Encyclopedia FTP from the GWDG

Ready-to-use-Encyclopedias are in ebooks , in another subdirectory which is the iso language id of the language version you want - i.e. de for German.

The encyclopedias come as .tgz archives (a tar archive compressed via gzip). They can be decompressed i.e. by

  • The tar utility - if you'r in Linux, Unix or MacOSX
  • By 7z - if you're a Windows(tm) user

Hints:

  • One file on the file server is one encyclopedia - you've to download just one .
  • This One File is a compressed tar/gzip archive. Some dumb browsers might save those file with the extension .gz . In this case you'll have to change the extension from .gz to .tgz before decompressing.
  • Each encyclopedia is an archive that consists of several files. They have to be put on a memory card, into the /ebooks folder. Make sure to keep the directory structure . Some of the encyclopedia releases rely on a subdocs folder which has to contain the real data files.
  • All encyclopedia language releases have unique filenames and IDs which means different language versions can be used together on a device.

The reader software for your mobile device is available here:

Support

If you want to support my work, help is very welcome. However, I'm not allowed to accept any Money but I do need Hardware (esp. Memory cards and cell phones/PDAs) to test the eBooks on. If you feel the urge to do so, please contact me :-) .

This is what people already sent me:

  • A Series 60.2 cell phone. Thank you, Gottfried B.

You want to contact me?

Please use the mailinglist whenever it's possible.

Important:

  • Please use either english or german language to contact me. Please read the Download-Section for information on how to download encyclopedia files and installing them on your mobile device.
  • This project is not just about a german encyclopedia but about other languages', too. This is why everything is in english on this page.
  • This project is work in progress - like i.e. open source projects are. You won't get a perfectly working Product which you would get (or not) if you pay for it. You get eBooks made in hard working people's spare time which you have to improve yourself by sending feedback.
  • I will not change the archive format from tar/gzip to i.e. .zip or 7z !

What do I need to make those eBooks myself?

  • My conversion scripts which are freely available. They can be found in a svn repository: https://fbo.no-ip.org/svn/fbo/wp2prc .
  • A fast internet connection: The converion script will download the HTML dump, static images and formulas - together ~ 8 GiB (german Wikipedia).
  • One (or more for cluster compression mode) installed linux machines with wine installed
  • For cluster cmpression mode:
    • ssh + either Public-Key-Authentication or Kerberos-Authentication to automatically connect to cluster nodes
  • 1GB of memory on any cluster node compressing parts of any encyclopedia
  • Lots of CPU time (-> weeks , in case of the english encyclopedia: months )

How does the conversion process work?

  • The conversion script will convert a static dump (see http://static.wikipedia.org) of the wikipedia to an electronic book with the possibility to lookup words (eletronic dictionary). The format of the book will be Mobipocket.
  • The static dump consists of one or more .7z -files containing all the Articles.
  • The filesystem, the static dumps will be decompressed into, must be really well performing. It has to deal with Millions of files.
  • The HTML dump will be scanned for "useless" (for mobile use) articles like discussion pages and templates - they'll be removed.
  • The static HTML-dump will be scanned, each article will be read and be classified to be either a real article or a symbolic link to another article. Everything will be written into a raw article list (articles.raw)
  • The raw article list will be further processed - symbolic links will be resolved along a possible symlink chain, leaving each symlink either pointing to a real article or being classified as broken. Broken ones won't be processed further.
  • All real articles will be written into one or more usually very big HTML-files. Articles are sorted and being written into those files one after another - until a given html-file is too big (which means a changable limit (--chaptersize) is exceeded). When a html-file is too big, the next one will be created. The idea behind that is to split the encyclopedia across multiple mobipocket datafiles because of some restrictions in the mobipocket-format and the mobipocket compiler which performes bad or doesn't work at all with very big files.
  • There is no simple connection between an article's name an the datafile it resides in (which means you can't say "find A..C in the first file, "D..F in the second, ...").
  • A windows-only commandline tool called mobigen.exe is used to convert the datafile sources to real mobipocket datafiles. 'wine' is used on Linux computer to execute that tool. Using the default HTML-file size limit, the compression processes itself needs about 700MB of physical memory on a linux machine. Together with a reasonable amount of Buffer-Cache you're going to need 1GB of memory.
  • The conversion tool is able to distribute compression load to multiple hosts in a network which is highly advisable because compiling a whole Wikipedia (even the german one) takes weeks of cpu time. Memory requirements of the clusternodes are equal to the requirements of a single machine. Job distribution is done via ssh and scp . Don't forget to install wine on all the machines and have a lot of space in /tmp ready.
  • The conversion tool is able to use SMP machine multiple times in the cluster (Multiple eBook parts are compressed in parallel on them). This works for Multicore machines, too. The memory has to be multiplied with the number of jobsof course ;-) .
  • It's hard to tell, how long a compression step will take - this depends entirelyon the text material. Pure ASCII (which applies to most of the english WP) is usually easier to process than something very unicody (i.e. arabic).

Why... ?

... HTML dumps? - there are well structured XML-Dumps!

The Mobipocket conversion tool needs HTML as source for eBooks. HTML dumps just have to be striped of the navigation menu and some other stuff and voila: the article is suitable to be included into an eBook. For the japanese Encyclopedia i.e. this took 2.9h (about 30min CPU time).

XML dumps have to be preprocessed - templates have to be resolved, lists/tables/style-elements/... have to be rendered to html. This would take at least ten times more time than the simple HTML-stripping - and the process is more difficult to parallelize.

You want something done different?

The whole conversion process is fully automated. There's no way to e.g. exclude articles or include newer version of articles. It's a conversion of a full HTML snapshot to an encyclopedia. However, if you've got improvement suggestions like Article headings are too big and waste space or Please include formulas which can be implemented by changing the conversion process - please tell us.

Bugs and Un-nice-ities

  • Mobipocket-Indices only holds a maximum of 255 different Characters. This is enough for most latin languages like German or English. But as soon as too many special characters are being used, it's impossible to make an index. This is currently a problem in the french encyclopedia which I fixed by removing all the symbolic links.
  • Formulas which are contained in articles are not shown in older encyclopedias. Newer ones (like the 08/2007 german one) do contain formulas as images. The images are Latex based and taken from the respective wikipedia.
  • Very large tables like Periodic tables are unreadable. I currently don't know, how to fix that - it's a problem in the mobipocket html renderer.
  • For Symbian Series 60 (S60) 3rd edition, the mobipocket reader 5.3 build 576 has some known bugs affecting usage of WPMP encyclopedia. You can not search the book text and "start" and "first page" do not work. You have to manually create the directory ebooks in the root of your memory card and inside it (or a subfolder) wpmp in order to have it listed in the library. (Bug information by Georg Dembowski)

Other eBooks

Beside multilingual encyclopedias, the WPMP project covers mobile knowledge in general. Here is a list of what we are planning and what we've accomplished , yet:

  • Mobile dictionaries
    • Freedict: look for ebooks/various/freedict.mobi on a WPMP mirror.
    • TU-Chemnitz: This one will be less economic because I will have to do a "search" for any contained word and present a linked list of results of that (the Ding way).
    • Wiktionary: Seems very complicated because I've to render pages from Mediawiki notation. However, someone did it before - it should be possible.
  • Other databases
    • IMDB: This one is tricky for several reasons:
      • The source files are a mess. It is extremly difficult to parse them
      • There are legal problems (AFAIK it's not allowed to distribute the data)
      • Making the eBook will use huge amounts of RAM or lots of time - I'd prefer the lots of RAM way :-) .

Links

Todo

  • Remove relative links which are not intra-article
  • Fix stupid math-image-url-bug
  • Find bug in wine-call preventing ssh from returning
  • finish clustering getformulas and getimages

Big Changes

  • test everything with a small language version on a regular basis

[Zurück zum Start]

Aktuelle Wiki-Seite: Public > MobiPedia

[Zurück zum Start]