Guilty secret: I've wanted to for a really really long time.
Every few months the Wikimedia Foundation bundles up different compilations of Wikipedia's content, including just current text, just abstracts, all revision text, etc., and releases it as a nice little download. By little, I mean the same way you'd refer to your pet elephant as your nice little friend. The full download of all pages and all revisions--just text, not images or other media--apparently expands to "multiple terabytes" of data, so, you know, time to clean out your pirated Kenny G albums and free up a little space. I myself opted for the comparatively miniscule current revision download, choosing the 10.3GB download that, when unzipped, is a tiny 46GB XML file.
Wait what? One XML file?
Oh did I not mention that? Yes, all ~14.75 million articles are stored in one giant file. This presents a few problems, namely, actually accessing the data within. There are very few editors that will even try to open a 46GB XML file, and none that will successfully do so, which means that when you're trying to parse it you're flying relatively blind, based only on the structure from smaller examples that you can obtain from Wikipedia. It's a good test of writing bulletproof code, because you literally can't tell what curveballs the data might throw at you until you start going through it. Luckily, Wikipedia is pretty good about structuring their data, and it's nicely enclosed in sensible XML tags. I'm only dealing with a single revision of each page, thankfully, so I only have to worry about one set of text tags. Likewise, the title tag neatly denotes the article title. Both, along with various other bits of metadata--article ID, author, etc.--are surrounded by page tags.
On the surface, then, it seemed like a relatively simple parsing problem. Tons of languages have libraries built specifically to parse XML, but I wanted to write my own code to do so. Why? Two reasons. First: I wasn't sure if the libraries would try and recreate the structure of a 46GB file in my memory. Second: I've already gone to the trouble of downloading and unzipping Wikipedia's data dump for fun. Do you think I'd start using existing resources now? Nah. If I'm going to undertake a useless project, I'm going to undertake a grassroots useless project. After a little preliminary testing, I decided the first order of business would be to split up the file into manageable chunks. This in itself posed a tricky problem: no matter the solution, I'd have to balance a tradeoff. If I split the dump into more files, I'll be able to keep the filesize of each down, but will have to consequently deal with larger numbers of files. On the other hand, if I opt to keep my filesystem neater, I'm presented with the same problem I'm trying to solve, namely, large and unwieldy files that take forever to search through and parse (just running line by line through the original to get an article counter takes upwards of ten minutes).
I settled on (as I had to) a compromise; each article's content would be contained within an XML file corresponding to the first three characters of the title. I hacked together a little Python script that did just that, tested it, and set it in motion with eager expectations. Then I waited. It was creating files, that was for sure--there were a few thousand in a matter of seconds. But, silly me, I'd put in no way to track progress! I tweaked the script to say when it had processed a hundred files, then ran it again. Based on the time it took for a few hundred, I calculated an estimated eighty hours would elapse before it had processed the entire data dump. Which was a little longer than I was willing to wait, and plus, half the fun was in the optimization! At least I hoped, because otherwise I was making this out in my head to be a lot more fun than it actually would be. I assumed C++ would be much faster, because, you know, it's C++. Nope. First program took ten times as long as the Python script. I refined it to use the same parsing and processing algorithm as the script, and ran it again, hoping to see some massive difference in speed. Nope again! It still took ever so slightly longer than the script (somewhere on the order of a few more tenths of a second for the first few hundred articles).
I was crushed. The language I compiled and debugged as if it were my own child (I don't know what it means, work it out in your own head) had disappointed me. Nevertheless, I let it run for a while to see how it progressed, and noticed that as it continued, it ran slower and slower. Around the hundred thousand article mark, I realized it was taking upwards of seven or eight seconds to process the same number of articles it had spent only half a second on at the program start. I'm still working on ways to optimize the processing, but I'm pretty sure it has something to do with the fact that as it handles more and more articles, it's opening, appending onto, and closing files that are getting larger and larger, bringing my original estimate up to perhaps hundreds of hours. With this in mind, I might go the route of many articles, for while I don't relish the thought of 14.75 million files on a hard drive (that of my laptop or external), it might make things run faster. Another option is to just use a faster computer, which is an avenue I'm exploring. Once I get all these articles separated out, you may ask, what am I planning on actually doing with this massive and disgustingly redundant store of information?
I have absolutely no idea yet.
It's no big secret that I'm a huge fan of The Hitchhiker's Guide to the Galaxy, which is probably where the inspiration for the project came from in the first place. I'd like to make a simple program for starters that takes a search, returns a list of articles, and then, given an article, displays it in a nice way--something made easier by the fact that the texts of the dump preserves all of the Wiki markup. If I have the resources, I'm tempted to dump it on my RasPi, get a cheap portable display, and make an actual mobile Wikipedia. Yes, it can be done with a data connection, as several friends have pointed out to me. No, that's not as much fun.
I'm putting together a slidedeck about the project, and I'm nowhere near done. More to come soon.
Subscribe to Kienan Knight-Boehm
Get the latest posts delivered right to your inbox