TLDR; Pickle is slow, cPickle is faster, JSON is surprisingly fast, Marshal is fastest, but when using PyPy all fall before a humble text file parser.
I’ve been working on the Yandex Personalized Web Search challenge on kaggle, which requires me to read in a large amount of data stored as multi-line, tab-separated, records. The data is ragged, each record has sub-fields of variable length, and there are 34.5 million of them. There are too many records to load in memory, and starting out, I’m not quite sure what features will end up being important. To reduce the bottle-neck of reading the files from disk, I wanted to pre-process the data, and store it in some format which afforded me faster reading and writing off the disk, without going to a full blown database solution. I ended up testing the following formats:
- pickle (using both the pickle and cPickle modules), a module which can serialize just about any datas structure, but is only available for python. I’ve used this extensively to persistify derived data for work.
- JSON, which only supports a basic set of data-types (fine for this use case), but can be handled by a larger number of languages (irrelevant for this use case).
- Marshal, which is not really meant for general serialization, but I was curious about it’s internals. This is not really a recommended format, the documentation clearly warns “Details of the format are undocumented on purpose; it may change between Python versions.”
TL;DR: I talk about some text frequency analysis I did on the arxiv.org corpus using python, mysql, and R to identify trends and spot interesting new physics results.
In one of my previous posts, I mentioned some optimization I had done on a word-frequency anlysis tool. I thought I’d say a bit more here about the tool (buzzArxiv), which I’ve put together using Python and R to find articles that are creating a lot of ‘buzz’.
For those who don’t know, arxiv.org is an online repository of scientific papers, categorized by field (experimental particle physics, astrophysics, condensed matter, etc…). Most of the pre-print articles posted to the arxiv eventually also get submitted to journals, but it’s usually on the arxiv that the word about new work gets disseminated in the community. The thing is, there’s just gobs of new material on there every day. The experimental, pheonomonolgy, and theory particle physics mailings can each have a dozen or so articles per day.
TL;DR: There were some rookie mistakes lurking in some python code which I hunted down with profiling (using cProfile), reducing execution time by an order of magnitude.
This post is a little story about some profiling I did on python code which ultimately revealed that a poor choice of datastructures had a significant impact on code performance. The problem was in some python code which I use to look at trends in physics topics posted to arxiv.org (more on the output of the tool in another post). After pulling the data from arxiv, the script works out word frequency and identifies increased occurences. The parsing of the corpus was taking an excessively long time, my laptop fan would start whirling, and the system would slow to a crawl. Clearly, something very non-optimal was happening.
My first thought was that the issue was in a part of the code that handles the histograming of the occurences, but nothing obvious jumped out at me when I reviewed the code. Enter the python profilers. These things are gret, they’re dead simple to run, and the results are relatively easy to parse. Of the three profilers the python documentation recommends, cProfile is my favorite.