Journal

By Steve Challis

Recent Entries

Archive

RSS/Atom

Home

Projects

@schallis

Incremental Backups Using rsync

2 years, 6 months ago — 0 Comments — Permalink

  • backups
  • unix
  • rsync
  • hard-links

We can combine rsync and cp -al to create what appear to be multiple full backups of a filesystem without taking multiple disks’ worth of space.

This is a really neat and efficient method for backing up data which utilises the fact that rsync will always unlink an inode before overwriting it.

The Lazy Way To Download Journal Articles

2 years, 6 months ago — 3 Comments — Permalink

  • awk
  • jquery
  • script
  • unix
  • wget

So imagine you have a load of PDF’s you’d like to download from a website but they all have the same filename and it takes you a while to rename them, plus it’s getting very tiring clicking on them all. Yep, it’s time for a script …

It turns out Springerlink are offering a bunch of such articles for download on their website and I wanted to browse them all locally. I started out by retrieving a list of titles and urls in the format ‘[title] [url]\n’ using jQuery and the following snippet of code:

$('.journalArticle').each(function(index, object) {                                                                                                                             
    var base = ' http://www.springerlink.com';
    var title = $(this).find('.title a').text();
    var url = $(this).find('.pdf a').attr('href');
    console.log(title + base + url);
})

Each article is wrapped in a div with class="journalArticle" so we can just iterate through those and pull out the title and url (both of which are also conveniently marked with ‘title’ and ‘pdf’ classes respectively). The resulting list can be copied into a text file. Now all we need is a script to run through this file and do the downloading:

awk '{
    system("wget --user-agent=1337 " $NF);
    $NF=""; NF--;
    gsub(" ","",$0);
    print $0;
    system("mv fulltext.pdf " $0 ".pdf")
}' mylist.txt

Springerlink reject requests without a user agent header so I’ve added in a false one which they accept nicely. The script just calls wget on the last part of everyline (the url) then removes spaces from the remaining part of the line (the title) and moves the downloaded file to a file with this name. It’s pretty horrible so let’s have another go:

cat mylist.txt | \
   sed 's;\(.*\) \(.*\)\(\.[^.]*\)$;wget "\2\3" -O "\1\3";' | \
   bash -

Sed does a much better job of grabbing the arguments, and utilising the -O parameter of wget is a far easier way to achieve the rename. Job done.

« NewerOlder »

Log in

Powered by Mumblr – a basic Django tumblelog application that uses MongoDB with MongoEngine. Fork it on Github. Designed and developed by Harry Marr and Steve Challis.

Unless otherwise noted, everything here is available under the Creative Commons Attribution-Share Alike 3.0 license. Sharing is fucking cool.

Home / Projects / Recent / Archive / RSS /