The Lazy Way To Download Journal Articles
2 years, 6 months ago — 3 Comments — Permalink
So imagine you have a load of PDF’s you’d like to download from a website but they all have the same filename and it takes you a while to rename them, plus it’s getting very tiring clicking on them all. Yep, it’s time for a script …
It turns out Springerlink are offering a bunch of such articles for download on their website and I wanted to browse them all locally. I started out by retrieving a list of titles and urls in the format
$('.journalArticle').each(function(index, object) { var base = ' http://www.springerlink.com'; var title = $(this).find('.title a').text(); var url = $(this).find('.pdf a').attr('href'); console.log(title + base + url); })
Each article is wrapped in a div with class="journalArticle" so we can just iterate through those and pull out the title and url (both of which are also conveniently marked with ‘title’ and ‘pdf’ classes respectively). The resulting list can be copied into a text file. Now all we need is a script to run through this file and do the downloading:
awk '{ system("wget --user-agent=1337 " $NF); $NF=""; NF--; gsub(" ","",$0); print $0; system("mv fulltext.pdf " $0 ".pdf") }' mylist.txt
Springerlink reject requests without a user agent header so I’ve added in a false one which they accept nicely. The script just calls wget on the last part of everyline (the url) then removes spaces from the remaining part of the line (the title) and moves the downloaded file to a file with this name. It’s pretty horrible so let’s have another go:
cat mylist.txt | \ sed 's;\(.*\) \(.*\)\(\.[^.]*\)$;wget "\2\3" -O "\1\3";' | \ bash -
Sed does a much better job of grabbing the arguments, and utilising the -O parameter of wget is a far easier way to achieve the rename. Job done.