WPMP - Encyclopedias for the road
Important:
- Please read the Download-Section to find out what to download and what to do with the downloaded File. Have a look at the table on the right to find out the current suggested (stable) german encyclopedia release.
Latest releases:
| Language |
Stable |
Experimental |
 |
2007/08_8 2007/08-small_1 |
2008/06_1e |
 |
2007/09_3 |
2008/06_1e3 |
 |
- |
2008/06_1e2 |
News
- 02.07.2008 400k -Version of the german encyclopedia is released. This an eBook containing a filtered set of articles (the 400.000 most viewed). This version needs ~740MB of memorycard space.
- 02.07.2008 A bug was discovered which makes Articles containing a
":" being removed before conversion. This will be fixed in the next major release of each language.
- 02.07.2008 Fixed some Search-Index-related bugs in the english and french encyclopedias. New releases are en-1e2 and fr-1e3 . Please send feedback.
- 01.07.2008 Spanish encyclopedia is released. Please send feedback.
- 30.06.2008 Added missing search index to french release - new version is 1e2 .
- 30.06.2008 Catalan and English (finally a working one - I hope) encyclopedias are released. Please send feedback.
What's WPMP about?
WPMP is a Project, aiming to provide encyclopedias which can be viewed on mobile computers like cell phones or PDAs. The focus is on a fully automated procedure which make regular updates easy.
The encyclopedias are distributed in a highly compressed eBook-format called
Mobipocket which can be read by a proprietary but freely usable reading software by
Mobipocket.Com ).
However, the data included in those eBooks is free (as in
Free speech ). It was compiled from a snapshot of the
Wikipedia which was taken in August 2007.
As soon as more recent snapshots become available, the WPMP project will release new encyclopedia-eBooks, too.
Actually the WPMP eBooks are mobile versions of the Wikipedia - however, the WPMP project is
in no way associated to the Wikimedia Foundation (which is the organisation behind the Wikipedia).
All eBooks contain an Index - a search dialog for easy lookup of topics/words.
Which languages are supported?
Latest completed dump id: 2008/06
This is the list of languages in which the WPMP project already released encyclopedia eBooks in:
- German
- English
- French
- Dutch
- Italian
- Bavarian
- Esperanto
- Finnish
- Spanish
These are the encyclopedia languages which are currently being processed or tested:
Download and installation
Overview: You'll have to download a single archive file (
.tgz ), decompress it and copy the resulting folder + subfolders on your mobile computer's memory card or internal memory. Make sure to read the whole *download and Installation= section before asking questions.
Encyclopedia FTP from the
GWDG
Ready-to-use-Encyclopedias are in
ebooks , in another subdirectory which is the iso language id of the language version you want - i.e.
de for German.
The encyclopedias come as
.tgz archives (a tar archive compressed via gzip). They can be decompressed i.e. by
- The
tar utility - if you'r in Linux, Unix or MacOSX
- By 7z - if you're a Windows(tm) user
Hints:
- One file on the file server is one encyclopedia - you've to download just one .
- This One File is a compressed
tar/gzip archive. Some dumb browsers might save those file with the extension .gz . In this case you'll have to change the extension from .gz to .tgz before decompressing.
- Each encyclopedia is an archive that consists of several files. They have to be put on a memory card, into the
/ebooks folder. Make sure to keep the directory structure . Some of the encyclopedia releases rely on a subdocs folder which has to contain the real data files.
- All encyclopedia language releases have unique filenames and IDs which means different language versions can be used together on a device.
The reader software for your mobile device is available here:
Support
If you want to support my work, help is very welcome. However, I'm not allowed to accept any Money but I do need Hardware (esp. Memory cards and cell phones/PDAs) to test the eBooks on. If you feel the urge to do so, please contact me :-) .
This is what people already sent me:
- A Series 60.2 cell phone. Thank you, Gottfried B.
You want to contact me?
Please use the mailinglist whenever it's possible.
Important:
- Please use either english or german language to contact me. Please read the Download-Section for information on how to download encyclopedia files and installing them on your mobile device.
- This project is not just about a german encyclopedia but about other languages', too. This is why everything is in english on this page.
- This project is work in progress - like i.e. open source projects are. You won't get a perfectly working Product which you would get (or not) if you pay for it. You get eBooks made in hard working people's spare time which you have to improve yourself by sending feedback.
- I will not change the archive format from
tar/gzip to i.e. .zip or 7z !
What do I need to make those eBooks myself?
- My conversion scripts which are freely available. They can be found in a svn repository: https://fbo.no-ip.org/svn/fbo/wp2prc .
- A fast internet connection: The converion script will download the HTML dump, static images and formulas - together ~ 8 GiB (german Wikipedia).
- One (or more for cluster compression mode) installed linux machines with
wine installed
- For cluster cmpression mode:
-
ssh + either Public-Key-Authentication or Kerberos-Authentication to automatically connect to cluster nodes
- 1GB of memory on any cluster node compressing parts of any encyclopedia
- Lots of CPU time (-> weeks , in case of the english encyclopedia: months )
How does the conversion process work?
- The conversion script will convert a static dump (see http://static.wikipedia.org) of the wikipedia to an electronic book with the possibility to lookup words (eletronic dictionary). The format of the book will be Mobipocket.
- The static dump consists of one or more
.7z -files containing all the Articles.
- The filesystem, the static dumps will be decompressed into, must be really well performing. It has to deal with Millions of files.
- The HTML dump will be scanned for "useless" (for mobile use) articles like discussion pages and templates - they'll be removed. Unfortunately those "articles"'s names are language dependent which makes it difficult to remove them on without having some information about the processed language. Supported languages are:
- German
- English
- Italian
- French
- Portugese
- Japanese
- Spanish
- Remember: If your wikipedia language is not supported, this just means, the resulting eBook will be unneccessary large - it will still work.
- The static HTML-dump will be scanned, each article will be read and be classified to be either a real article or a symbolic link to another article. Everything will be written into a raw article list (articles.raw)
- The raw article list will be further processed - symbolic links will be resolved along a possible symlink chain, leaving each symlink either pointing to a real article or being classified as broken. Broken ones won't be processed further.
- All real articles will be written into one or more usually very big HTML-files. Articles are sorted and being written into those files one after another - until a given html-file is too big (which means a changable limit (--chaptersize) is exceeded). When a html-file is too big, the next one will be created. The idea behind that is to split the encyclopedia across multiple mobipocket datafiles because of some restrictions in the mobipocket-format and the mobipocket compiler which performes bad or doesn't work at all with very big files.
- There is no simple connection between the an article's name an the datafile it resides in (which means you can't say "find A..C in the first file, "D..F in the second, ...").
- A windows-only commandline tool called mobigen.exe is used to convert the datafile sources to real mobipocket datafiles. 'wine' is used on Linux computer to execute that tool. Using the default HTML-file size limit, the compression processes itself needs about 700MB of physical memory on a linux machine. Together with a reasonable amount of Buffer-Cache you're going to need 1GB of memory.
- The conversion tool is able to distribute compression load to multiple hosts in a network which is highly advisable because compiling a whole Wikipedia (even the german one) takes weeks of cpu time. Memory requirements of the clusternodes are equal to the requirements of a single machine. Job distribution is done via
ssh and scp . Don't forget to install wine on all the machines and have a lot of space in /tmp ready.
- The conversion tool is able to use SMP machine multiple times in the cluster (Multiple eBook parts are compressed in parallel on them). This works for Multicore machines, too.
- A single workunit takes ~13 hours on a P4-3.2GHz (which is the lower time-boundary for creating an encyclopedia - even if you have thousends of PCs ;-) ).
- (The WPMP eBooks were compressed on a cluster of Debian-Etch machines)
- Another windows-only commandline tool called prcgen.exe (the predecessor of mobigen.exe) is used to merge the data-file's index information into a standalone-index file.
- The complete eBook is the archived into a
.tar.gz file.
Why... ?
... HTML dumps? - there are well structured XML-Dumps!
The Mobipocket conversion tool needs HTML as source for eBooks. HTML dumps just have to be striped of the navigation menu and some other stuff and voila: the article is suitable to be included into an eBook. For the japanese Encyclopedia i.e. this took 2.9h (about 30min CPU time).
XML dumps have to be preprocessed - templates have to be resolved, lists/tables/style-elements/... have to be rendered to html. This would take at least ten times more time than the simple HTML-stripping - and the process is more difficult to parallelize.
You want something done different?
The whole conversion process is fully automated. There's no way to e.g. exclude articles or include newer version of articles. It's a
conversion of a full HTML snapshot to an encyclopedia. However, if you've got improvement suggestions like
Article headings are too big and waste space or
Please include formulas which can be implemented by changing the conversion process - please tell us.
Bugs and Un-nice-ities
- The english Encyclopedia is more a proof-of-concept than a real release. It was processed using ~20 different versions of my conversion script and some things were done manually (which is bad, because humans aren't perfect).
- The most recent conversion tool from Mobipocket is unable to create an index over multiple files. I had to use the predecessor
prcgen.exe of that tool instead. However, prcgen.exe seems to mess up the index and doesn't like Unicode Characters in the index which is why only ASCII-Characters are displayed correctly.
-
prcgen.exe removes the index information from the datafiles, putting it into the main index file. However, files with their indexes removed are broken which is why the original files have to be used. This adds some MB of memory requirement to the eBooks because the Index has to be stored twice.
- Mobipocket-Indices only holds a maximum of 255 different Characters. This is enough for most latin languages like German or English. But as soon as too many special characters are being used, it's impossible to make an index. This is currently a problem in the french encyclopedia which I fixed by removing all the symbolic links.
- Formulas which are contained in articles are not shown in older encyclopedias. Newer ones (like the 08/2007 german one) do contain formulas as images. The images are Latex based and taken from the respective wikipedia.
- Very large tables like Periodic tables are unreadable. I currently don't know, how to fix that - it's a problem in the mobipocket html renderer.
Other eBooks
Beside multilingual encyclopedias, the WPMP project covers mobile knowledge in general. Here is a list of what we are planning and what
we've accomplished , yet:
- Mobile dictionaries
- Freedict: look for
ebooks/various/freedict.mobi on a WPMP mirror.
- TU-Chemnitz: This one will be less economic because I will have to do a "search" for any contained word and present a linked list of results of that (the Ding way).
- Wiktionary: Seems very complicated because I've to render pages from Mediawiki notation. However, someone did it before - it should be possible.
- Other databases
- IMDB: This one is tricky for several reasons:
- The source files are a mess. It is extremly difficult to parse them
- There are legal problems (AFAIK it's not allowed to distribute the data)
- Making the eBook will use huge amounts of RAM or lots of time - I'd prefer the lots of RAM way :-) .
Links
Todo
- Remove relative links which are not intra-article
- Fix stupid math-image-url-bug
- Find bug in wine-call preventing ssh from returning
- finish clustering getformulas and getimages
Big Changes
- test everything with a small language version on a regular basis
[Zurück zum Start]