Wednesday, December 10, 2014

Looking for a Literature Crunching Tool

Yesterday one of the librarians I follow on Twitter asked whether a specific technical tool exists. Here are the tweets:

I replied that I'm not in lib tech, but I think that such a tool could be written. I said that I suspect it will be a 90-10 problem: i.e., a tool that handles 90% of the cases can be created with a reasonable amount of effort, but if you really need that last 10%, it will get expensive.

I'm basing my opinion on the fact that I know Mr. Snarky wrote a similar tool at an old job, because I remember answering questions about scientific literature for him. However, he does not have that code, it is not in the public domain, and it was focused on a subset of the biomedical literature, so is not as broad as what @mchris4duke is after.

I promised to ask my bioinformatics and computational bio/chem friends, though, because I suspect people have written similar tools. So... does anyone have public domain code that does something like this? Or lessons learned from writing such a tool they want to share?

My only additional advice is that if she hires an undergrad to do this (she's currently at Stanford and is moving soon to MIT, so suitable undergrads should be available), she should keep an eye on that 10% problem, and be sure to keep the amount of time/money spent on the tool in line with the value delivered. When I've written or managed the writing of tools with the 90-10 problem, I've usually ended up stopping development somewhere in the 90-95% of cases handled range, and just handled the remaining cases by hand. Deciding when to stop tool development is a project-specific thing, though. It is all about the cost vs. the return vs. how hard it is to do the analysis of the remaining cases by hand.

So- does anyone have anything useful to add? Put it in the comments or tweet it in reply to the tweet I use to share this post.

(P.S., I am FINALLY about to notify the winner from this giveaway. So, if you don't hear from me by the end of the day, you didn't win. Sorry for the delay. Holiday season, blah blah blah.)


  1. Not automated or in bulk but perhaps a piece of the puzzle solved in that it does allow input of a copy/paste of a citation rather than formatted data. It is popular with UIUC users who have tried it:

  2. If the list of citations have DOIs, then you can submit them to The DOI resolver will check your IP address and point you to where you can get the paper. If your IP subscribes to the journal, then you will be sent to the appropriate page to download the article. If your IP doesn't subscribe, then you need to log in w/ your credentials for the journal (indiv. subscriber acct) or purchase the article.

    If you do this from a major research Univ. library, you should be able to get many if not most of your journal articles w/o extra payment.

    Like you, I'm an UC alumna with a lifetime alumni library account. (Well worth the $$$ when you graduate!) I can bring my laptop to any UC library and then download articles I want.

  3. i have endnote (a reference manager) and you can give it a list of citations and it will find them (you link it to a library such as a university library that presumably has access to most of the journals) and it will download all the pdfs of the papers for you. I think Papers for Mac does the same thing. off the shelf solution-even better than writing your own script!

  4. Tracy2:50 AM

    She's right that this is a tractable problem and shouldn't be too terrible of a script if it doesn't exist already. If she's at Stanford, there's also someone there in the library who could probably help her (I'll go tweet at her). Could she write a short blog post, or write a post that you can post, so we can point people at it to be able to answer her question?

    There's definitely a lot of interest in the library community about using these tools and some developing resources on teaching them to librarians in particular ( Also I happen to be going to teach a Data Carpentry ( workshop there in April if she's interested in learning more about doing some of the automation.


Sorry for the CAPTCHA, folks. The spammers were stealing too much of my time.