Journal

By Steve Challis

Recent Entries

Archive

RSS/Atom

Home

Projects

@schallis

The Lazy Way To Download Journal Articles

November 6th, 2010 — 3 Comments — Permalink

  • awk
  • jquery
  • script
  • unix
  • wget

So imagine you have a load of PDF's you'd like to download from a website but they all have the same filename and it takes you a while to rename them, plus it's getting very tiring clicking on them all. Yep, it's time for a script ...

It turns out Springerlink are offering a bunch of such articles for download on their website and I wanted to browse them all locally. I started out by retrieving a list of titles and urls in the format '[title] [url]\n' using jQuery and the following snippet of code:

$('.journalArticle').each(function(index, object) {                                                                                                                             
    var base = ' http://www.springerlink.com';
    var title = $(this).find('.title a').text();
    var url = $(this).find('.pdf a').attr('href');
    console.log(title + base + url);
})

Each article is wrapped in a div with class="journalArticle" so we can just iterate through those and pull out the title and url (both of which are also conveniently marked with 'title' and 'pdf' classes respectively). The resulting list can be copied into a text file. Now all we need is a script to run through this file and do the downloading:

awk '{
    system("wget --user-agent=1337 " $NF);
    $NF=""; NF--;
    gsub(" ","",$0);
    print $0;
    system("mv fulltext.pdf " $0 ".pdf")
}' mylist.txt

Springerlink reject requests without a user agent header so I've added in a false one which they accept nicely. The script just calls wget on the last part of everyline (the url) then removes spaces from the remaining part of the line (the title) and moves the downloaded file to a file with this name. It's pretty horrible so let's have another go:

cat mylist.txt | \
   sed 's;\(.*\) \(.*\)\(\.[^.]*\)$;wget "\2\3" -O "\1\3";' | \
   bash -

Sed does a much better job of grabbing the arguments, and utilising the -O parameter of wget is a far easier way to achieve the rename. Job done.

Discussion

  • Dave Concannon 2 years, 6 months ago

    Nice approach, I'd usually use BeautifulSoup and a similar command line pipe for this sort of thing, but this is way less hassle.

  • Steve Challis 2 years, 6 months ago

    Dave, yeah BeautifulSoup would be a good way to doo it also. I've actually been looking for excuses to use standard Unix tools and have been surprised how damn useful they really are.

  • a 2 years, 6 months ago

    downthemall - a firefox plugin.

Comments on this post have now been closed.

Log in

Powered by Mumblr – a basic Django tumblelog application that uses MongoDB with MongoEngine. Fork it on Github. Designed and developed by Harry Marr and Steve Challis.

Unless otherwise noted, everything here is available under the Creative Commons Attribution-Share Alike 3.0 license. Sharing is fucking cool.

Home / Projects / Recent / Archive / RSS /