Sunday 8 April 2012

bibtexbrowser... Music for Publication Lists (Part I)

A journey through the realms of boredom...

If your job involves writing academic papers, you have no real option but to maintain a publications list on your web page.

<rant>
To make things worse, in addition to an author's personal page, there is the institutional page which also needs kept up-to-date.
Some are still stuck with good old ~geo/public_html pages and manual updates. Most organisations have taken a step forward and have implemented their own bespoke publication management systems ('institutional repositories'). Those systems usually share some common characteristics: i) a poor attempt at a catchy "SOME-ACRO" name, ii) they cost $£¥€ to develop and iii) they are quite rubbish...
I mean seriously people, import from bibtex must be the first feature to implement, yet whenever I've landed a new job I've had to spend hours in front of some appalling, counter-intuitive, pointy pointy clicky clicky web-based interface, manually data-entering my (short) publication list. And then the management complains if the list is not being kept updated.
I can't even begin to imagine life for people with very long lists, (although I suspect they also have the option to delegate the task). Sometimes I wonder if this is why the term 'Selected Papers' was invented in the first place...
</rant>

But back to our personal web page. Doing everything manually is obviously not an option; it's tedious, boring, error-prone and deprives you of all the features that you could get if you were to generate the page dynamically (sort, search etc).

I simply could not possibly be bothered...

First Paradigm: Generate offline, serve statically

One can write a set of scripts to generate static HTML content from bibliographic data. Whenever the data are updated, content must be re-generated. This however does not necessarily have to be triggered manually by the user. It can be, for example, a cron job which runs every few hours. This is a lot less burden on your back-end than generating a page on the fly every time it's accessed. This is actually a very good approach, since the frequency of updates to a publication list is very low compared to the frequency at which it's being accessed.

This relies on having scripts which can write directly to the file system hosting the content. If your personal page is hosted in your own box or on some company or university server, this is often not a problem. However, if you are using some commercial hosting company this approach is probably not an option.

Dynamic Generation, Take One

(or 'How to make your own life harder than it needs to be')

As discussed earlier, I skipped the manual update business altogether. I would have gone for the approach above but when I mentioned the words shell, cron job, script and make to my hosting provider techies (who happen to be mates), they laughed at me with passion and said "Just generate it dynamically". That was a couple of years ago (or four).

I then had a quick look around the internet. What I was looking for was a system that would read from bibtex files, generate a page and add various links to each entry (DOI, publisher URL, pre-print pdf if available etc). Unfortunately, I couldn't quite find what I was looking for. Most solutions were either overkill or lacked features, so I ended up knocking together my own php-based system. At that time I was having my first go at php so, being rather clueless, I feared it would take me ages to write a bibtex parser.

I ended up with an intermediate solution. Here is how it worked, briefly:
  • As a starting point, I kept each paper in its own bibtex.
  • Using the bibutils suite, I would generate bibliographic data in MODS format - one MODS file from each bibtex file. This was done offline with GNU make. Remember: no shell access to the hosting machine, so this would take place in my own box.
  • The same build system would collate all individual .bib files to a single bibtex DB.
  • I would then upload the updated .bib and MODS files to my server. The MODS format is XML so it was very easy to parse in PHP and generate the pages.
At the start I thought this was really cool but I quickly started stumbling. Every time I had to make a change, no matter how minor, I had to:
  • Run make
  • Identify which bib and MODS files were affected by the change and re-upload them to the server. I'd also have to upload the integrated bib DB (which I would often forget).
A small typo or encoding error in the bib: make and upload. Paper got accepted: make and upload. Paper went live in ieeexplore and got a doi: make and upload. To make matters even worse, my web hosting uses a point and click upload interface so I couldn't even script the upload. I had to do it by hand but it was still a lot better than manually maintained static pages.

It only took a couple of new papers, a few dozen typos and a few thousand clicks to realise that:

I simply could no longer be bothered...

Next Attempt at Dynamic Generation: bibtexbrowser

(or 'I wish I'd spotted this earlier'...)

Thursday the 5th April 2012 was a day of revelation. That time came again when I had to update my publications list but I'd had enough of make/upload cycles. I googled "bibtex php parser" and I came across Martin Monperrus' bibtexbrowser. This excellent php script does exactly what I needed but hadn't found previously: The user uploads a bibtex file to the hosting box, the script generates the publication list on the fly.

The paradigm is the same as my previous system: 'php reads bibliography DB, php generates page'. However, it does so straight from the bibtex file without the need of an intermediate format, thus reducing maintenance overhead from 'annoying' to 'ridiculously trivial'.

It also has a few features that give it even further added value:
  • It can be used standalone, embedded in your own page or as a library. The library functionality is rather awesome but undocumented at the moment, I only found out about it by reading a bunch of comments inside the script.
  • It is very easy to customise.
  • It adds google scholar meta-data to your pages (and eprints and Dublin Core, if you want).
  • It generates COinS metadata so that software like zotero or mendeley can directly import from your page.
I played around with it over the last couple of days. Not only did it make me put the previous system swiftly in the bin, but it also made me want to tell the world how good it is.

In the next part of this post, I am going to share my experiences from customising it.