sawyl | Feb. 12th, 2010

A few months ago I wrote a python script to count the words, sentences, paragraphs, and word frequencies in my MA thesis. Having written it in something of a hurry, I discovered when I returned to it yesterday, that it wasn't terribly efficient and didn't scale terribly well when run against larger samples of text. So I decided to take a look with the profiler to see if I couldn't improve things.

I took the raw program and profiled it using largest chunk of text I had to hand, George Eliot's Middlemarch, as it happens. I noticed almost immediately that most of the code spent most of its time in the routine that updated the word frequency hash, and that the routine spent most of this time running a series of regular expressions to clean off any non-alphanumeric characters prior to updating the frequency hash. Suspecting that this wasn't very efficient, I decided to benchmark a few alternatives.

My original code looked something like the following:

def case1(word):
word = re.sub("^\\W*", "", word)
word = re.sub("\\W*$", "", word)
return word

So I decided to replace the pair of substitutions with a single one:

def case2(word):
return re.sub("^\\W*(\\w+?)\\W*$", "\\1", word)

But I discovered that this decreased performance by 25 per cent. So I decided to try a simple regexp match instead:

def case3(word):
m = re.match("^\\W*(\\w+?)\\W*$", word)
if m:
return m.group(1)

This gave a 160 per cent improvement on the original. But I realised I could do still better if I abandoned regexps altogether and replaced them with some plain string processing:

def case4(word):
return word.strip(",.?!:;'\"")

Sure enough, this gave a 1200 per cent improvement over the original re.sub() code, which reduced the cost of the function to the point where a re.split() in the caller become one of the dominant costs. Once this was replaced with an almost equivalent set of string splits, I found that the elapsed time of the program had been more than halved and the problem had moved from being CPU bound to being IO bound.

The moral of the story, then, is that although regular expressions are often useful they significantly limit code scalability and should be avoided in codes where performance is likely to be important.

Having finally caught the last episode of Radio 4's dramatisation of Le Carré's The Honourable Schoolboy, I now feel able to pronounce it good. The performances, as with all the dramatisations in the Smiley series, were consistently excellent. Simon Russell Beale was as great as ever and Maggie Steed was absolutely note perfect as Connie Sachs; while Hugh Bonneville and Daisy Haggard both managed to inhabit their damaged and doomed characters worryingly well.

As the only Smiley novel I can remember in any detail — I re-read it at Christmas — I caught myself mentally checking off some of the things lost or changed in the abridgement process. I particularly missed the mordant humour of the original, particularly in the early sections with the war-crazed reporters.

I also noticed that they'd taken the opportunity to make a few changes to George Smiley's character — something I suspected they'd done in Tinker, Tailor too. In the novel, the ultra-secretive Smiley is so silent and self-contained that he is almost an absence at the heart of the story. Thus, there is a constant confusion over whether George really knows exactly what is going on, whether he is making the right decisions by pure chance and whether he realises that Enderby and Martello are conspiring together. But the in dramatisation, Smiley often updates Guillam, taking him to a briefing with Oliver Lacon where, in the novel, Guillam is only present because, having driven George to the Lacons' house, George is unable to bring himself to insist that Peter remains in the car while he and Oliver discuss secrets over the supper table.

But I don't disagree with the alterations. They were almost certainly necessary in order to get the drama of the book to work on radio and were more in the way of tweaks and adjustments than wholesale changes. And they also gave Simon Russell Beale a chance to really inhabit Smiley's character, which can be nothing but a good thing...

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tales of a Fourth Grade Nothing

Feb. 12th, 2010

Feb. 12th, 2010

Tuning a text analyser

The Honourable Schoolboy on R4

Profile

August 2018

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags