buzzArxiv

TL;DR: I talk about some text frequency analysis I did on the arxiv.org corpus using python, mysql, and R to identify trends and spot interesting new physics results.

 

In one of my previous posts, I mentioned some optimization I had done on a word-frequency anlysis tool. I thought I’d say a bit more here about the tool (buzzArxiv), which I’ve put together using Python and R to find articles that are creating a lot of ‘buzz’.

For those who don’t know, arxiv.org is an online repository of scientific papers, categorized by field (experimental particle physics, astrophysics, condensed matter, etc…). Most of the pre-print articles posted to the arxiv eventually also get submitted to journals, but it’s usually on the arxiv that the word about new work gets disseminated in the community. The thing is, there’s just gobs of new material on there every day. The experimental, pheonomonolgy, and theory particle physics mailings can each have a dozen or so articles per day.

all_words
Continue reading

Making PHP and SHTML Play Nice

The css style and much of the layout of this site was stolen wholesale from http://www.freecsstemplates.org/. Laziness won out, I didn’t feel like designing the site from the ground up at the time, and these files, written using SHTML server-side includes, provided an easy spring-board. Now, as I try to expand the site several months later, I find myself locked into a format for which I had not thought out all of the ramifications.

 

So first, a word about the site implementation: SHTML provides an exceedingly simple server-side scripting language on top of HTML. Essentially, it gives you server-side includes (the ability to insert code from a different file before serving content to the user, which I use to display the same side-bar, footer, etc… on all of this site’s pages), simple loops and control directives, and not much else. This is brilliant if you’re interested in maintaining a site with shared content across a simple structure.

 

Now, wordpress, the popular content management system serving the text you are reading right now, is built on PHP. PHP is also a server-side scripting language, but orders of magnitude more complex (and powerful) than SHTML. Naively, to  integrate wordpress would require migrating all of my SHTML files to PHP. This is not an excruciatingly complicated process: rename a few files, change a few
<!-- #include virtual="myfile.html" -->
into
<?php include 'myfile.html' ?>,
and voila, the site would be ready to use wordpress. But, I’m ornery, and I don’t like change, and more importantly, I wanted to understand if the two languages could be made to run side by side.

 

Short answer: no, they can’t. The processors which parse the SHTML or PHP code and generate HTML to push to the client are chosen based on the filename, and you cannot  modify the server behavior to run these processors serially.

 

Hackish answer: yes, they can! The very feature of SHTML which originally motivated me to use it, the server-side includes, can solve this problem too. The SHTML include  directive does not simply dump text into the page, in the case of a recognized script, the server builds the result of the script before including it. Practically speaking, that means you can execute multiple PHP, perl, or any other CGI scripts, in the same file. The query string, which wordpress uses to determine which posts to display, can also be passed to the script by using some more server-side magic:
<!--#set var="blog_script" value="blog.php?$QUERY_STRING" -->
<!--#include virtual="$blog_script" -->

 

The may seem slightly round-about, but as it turns out, it incurs little to no penalty in the rendering time for the page. In my case, a few extra files (blog.php, permalink.php) with snippets of PHP allow me to render entries written in wordpress without any other site
redesign.

 

Conclusion: it would have been faster to migrate to php, but now I know a bit more about the technology serving this site. Active laziness wins again.

 

P.S.
I said that the performance of pages served using PHP and SHTML were essentially the same. This is perhaps not quite so true.

The above plot shows the retrieval time for two identical pages, one using SHTML to include a PHP snippet, the other using PHP natively. You can see that the SHTML page is slower on average (by about 12% actually), but that the spread easily covers this difference.

Yes, this means I went through the effort of migrating one of my pages from SHTML to PHP, but that’s besides the point. 😉