2 years, 6 months ago
—
—
Permalink
So imagine you have a load of PDF’s you’d like to download from a website but they all have the same filename and it takes you a while to rename them, plus it’s getting very tiring clicking on them all. Yep, it’s time for a script …
It turns out Springerlink are offering a bunch of such articles for download on their website and I wanted to browse them all locally. I started out by retrieving a list of titles and urls in the format ‘[title] [url]\n’ using jQuery and the following snippet of code:
$('.journalArticle').each(function(index, object) {
var base = ' http://www.springerlink.com';
var title = $(this).find('.title a').text();
var url = $(this).find('.pdf a').attr('href');
console.log(title + base + url);
})
Each article is wrapped in a div with class="journalArticle" so we can just iterate through those and pull out the title and url (both of which are also conveniently marked with ‘title’ and ‘pdf’ classes respectively). The resulting list can be copied into a text file. Now all we need is a script to run through this file and do the downloading:
awk '{
system("wget --user-agent=1337 " $NF);
$NF=""; NF--;
gsub(" ","",$0);
print $0;
system("mv fulltext.pdf " $0 ".pdf")
}' mylist.txt
Springerlink reject requests without a user agent header so I’ve added in a false one which they accept nicely. The script just calls wget on the last part of everyline (the url) then removes spaces from the remaining part of the line (the title) and moves the downloaded file to a file with this name. It’s pretty horrible so let’s have another go:
cat mylist.txt | \
sed 's;\(.*\) \(.*\)\(\.[^.]*\)$;wget "\2\3" -O "\1\3";' | \
bash -
Sed does a much better job of grabbing the arguments, and utilising the -O parameter of wget is a far easier way to achieve the rename. Job done.
2 years, 6 months ago
—
—
Permalink
I was recently presented with the problem of adding consistent prefixes to paragraphs of text whilst respecting the text width setting, for example when responding to email you may wish to prefix text with the authors initials like so:
SC> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
SC> eiusmod tempor incididunt ut labore et dolore magna aliqua.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non.
As a Vim user, I searched for a way to do this automatically and the best I came up with was to set Vim to use the GNU fmt program (available in the coreutils package for us Mac users stuck with lesser BSD equivalents) e.g. :se formatprg=fmt\ -p\ SC\>. The downside to this is having to specify the prefix manually. Emacs has the ability to automatically set the fill prefix and use that prefix to format paragraphs. Since you can write Vim plugins in Python, I came up with the following:
function! FillPrefix()
py << EOF
import vim
cur_line = vim.current.line
if not cur_line:
prefix = '' # Unset prefix if called on empty line
else:
prefix = 'n:' + cur_line.split(' ')[0]
vim.command('set comments=%s' % prefix)
EOF
endfunction
This takes the first word from the line you called it on and sets that to the prefix. Now when you hit gq on a line with that prefix, it’ll be wrapped with a consistent prefix and text width. My colleague and fellow workflow enthusiast, Affan, came up with a more Emacs-like alternative which uses the cursor position to determine the prefix:
function! SetQuotePrefixFromCursor()
python << EOF
import vim
cursor_col = vim.current.window.cursor[1]
quote_prefix = vim.current.line[:cursor_col]
if quote_prefix:
set_cmd = 'set comments=n:%s' % quote_prefix
else:
set_cmd = 'set comments=' # Cancel quoting prefix
vim.command(set_cmd)
EOF
endfunction
I saved this as fillprefix.py in my .vim folder and bound the <leader>fp shortcut to it:
nmap <silent> <leader>fp :call SetQuotePrefixFromCursor()<CR>
As a side note, autocompletion (C-x C-O) came in handy when exploring the Vim Python module, although oddly requires Emacs style C-n and C-p to navigate.
2 years, 6 months ago
—
—
Permalink
Apparently the US Yellow Pages industry is a $15bn industry - 66% more than Hollywood! Fascinating.
Humerous comments on this article too …
2 years, 6 months ago
—
—
Permalink
Interesting performance stats, and 12 billion documents is a fairly substantial amount of data.
This is the first time I’ve seen major use of GridFS too:
We now store our audio files in MongoDB’s GridFS. Previously we used a clustered file system so files could be read and written from multiple servers. This created a huge amount of complexity from the IT operations point of view, and it meant that system backups (database + audio data) could get out of sync. Now that they’re in Mongo, we can reach them anywhere in the data center with the same mongo driver, and backups are consistent across the system.
2 years, 7 months ago
—
—
Permalink
Unfortunately this post is not an absurd nerdy joke about three buzzwords in a public house. Although I could tell one about sin(x), cos(x), and ex …
MongoEngine v0.4
So, MongoEngine 0.4 has just been released and I’d like to write a bit about one new feature in particular. The latest and greatest release of MongoEngine includes support for MongoDB’s GridFS storage engine. GridFS is an exciting technology which allows the storage of files directly within a MongoDB database. This means your files get the benefits of replication and sharding, just like the rest of your data. Once the files are in the database, serving them via Nginx is simple with Mike Dirolf’s nginx-gridfs module.
This functionality is implemented as a field called FileField. This new field behaves as a file-like object, allowing for natural use with other Python code. Reading and writing data to this new field is as easy as reading and writing from a regular file:
class Painting(Document):
artist = StringField()
date = DateTimeField(default=datetime.now)
photo = FileField()
thumbnail = FileField()
my_painting = Painting(artist='Steve')
my_painting.photo = open('my_painting.jpg', 'r')
my_painting.thumb = open('my_painting_thumbnail.jpg', 'r').read()
my_painting.save()
The great thing about these FileFields is that since they are file-like objects, we can pass them to (almost) anything that accepts files. In the above example I used a separate image as the thumbnail. I could actually just generate the thumbnail from the original photo using PIL and save directly to another FileField:
# Open and save original image
filename = 'my_painting.jpg'
photo = open(filename, 'r')
my_painting.photo = photo
# Use original image to create thumbnail
pil_image = Image.open(filename)
pil_image.thumbnail((80, 80), Image.ANTIALIAS)
# Stream new thumbnail into the thumb FileField
my_painting.thumb.new_file()
pil_image.save(my_painting.thumb, 'jpeg', quality=85)
my_painting.thumb.close()
my_painting.save()
my_painting.photo.delete()
Deletion of files is just as you would expect. The delete() method can be called on a FileField to remove any stored object. It is important to note that the FileField actually only stores the ID of a file in a separate GridFS collection. This means that deleting a document with a defined FileField does not actually delete the file. You must be careful to delete any files in a document as above before deleting the document itself.
The FileField also allows for storage of arbitrary metadata such as content_type or filename. The put() method allows for metadata to be stored using the same call as the file:
# Storage
my_painting.photo.put(photo, filename=filename,
content_type='image/jpeg')
# Retrieval
type = my_painting.photo.content_type
name = my_painting.photo.filename
Files can be replaced with the replace() method. This works just like the put() method so even metadata can (and should) be replaced:
another_painting = open('another_painting.png', 'r')
my_painting.photo.replace(another_painting, content_type='image/png')
Integration with Django
Since many people will be using this functionality with Django, it was a natural extension to complement the FileField with a custom storage backend. It’s called GridFSStorage and works like this:
# Create a GridFS based filesystem
fs = mongoengine.django.GridFSStorage()
# Attempt to save a new file called hello.txt
filename = fs.save('hello.txt', 'Hello, World!')
Just like the default Django storage backends, the save() method will try to save your file with the specified filename, and if it can’t then a new it will be saved under something else and returned. For this reason, it is important to save the returned filename and use it to refer to saved files later on.
GridFSStorage implements all of the current relevant calls in the Django File Storage API.
>>> fs.exists('hello.txt')
True
>>> fs.open('hello.txt').read()
'Hello, World!'
>>> fs.size('hello.txt')
13
>>> fs.url('hello.txt')
'http://your_media_url/hello.txt'
>>> fs.open('hello.txt').name
'hello.txt'
>>> fs.listdir()
([], [u'hello.txt'])
Serving GridFS files
So once you’ve got your files into MongoDB you’ll likely want to get them back out again as quickly as possible. There are a number of ways to do this but the simplest is to use Nginx with the nginx-gridfs module. Like all Nginx modules, this must be compiled in when Nginx is build. A simple configuration to serve files from the paintings_db collection would go something like this:
location /gridfs/ {
gridfs paintings_db field=filename type=string;
}
There are several benchmarks floating around that compare the different methods for serving GridFS files, but it really comes down to balancing simplicity against speed, and for most purposes I think the above will do nicely. If you have anything that is getting particularly high traffic then you’ll want to look into offloading some of the work to a dedicated CDN anyway.
Documentation
That’s pretty much all there is to the GridFS functionality in MongoEngine so far. You’ll find the documentation over at mongoengine.org and as always, the code is on Github for you to hack around with. I look forward to hearing what you do with this and to the improvements that will inevitably be submitted.
MongoEngine 0.4 also includes a bunch of other good stuff including a completely rewritten q-objects implementation, Geospacial support, and new queryset operators.
It’s on PyPi so you can upgrade with pip install -U mongoengine and try out some of these new features right now. Get it while it’s hot!