It’s All About the Datastructures — Stupid

TL;DR: There were some rookie mistakes lurking in some python code which I hunted down with profiling (using cProfile), reducing execution time by an order of magnitude.

 

This post is a little story about some profiling I did on python code which ultimately revealed that a poor choice of datastructures had a significant impact on code performance. The problem was in some python code which I use to look at trends in physics topics posted to arxiv.org (more on the output of the tool in another post). After pulling the data from arxiv, the script works out word frequency and identifies increased occurences. The parsing of the corpus was taking an excessively long time, my laptop fan would start whirling, and the system would slow to a crawl. Clearly, something very non-optimal was happening.

My first thought was that the issue was in a part of the code that handles the histograming of the occurences, but nothing obvious jumped out at me when I reviewed the code. Enter the python profilers. These things are gret, they’re dead simple to run, and the results are relatively easy to parse. Of the three profilers the python documentation recommends, cProfile is my favorite.
Continue reading