Tuning a text analyser
Feb. 12th, 2010 03:44 pmI took the raw program and profiled it using largest chunk of text I had to hand, George Eliot's Middlemarch, as it happens. I noticed almost immediately that most of the code spent most of its time in the routine that updated the word frequency hash, and that the routine spent most of this time running a series of regular expressions to clean off any non-alphanumeric characters prior to updating the frequency hash. Suspecting that this wasn't very efficient, I decided to benchmark a few alternatives.
My original code looked something like the following:
def case1(word):
word = re.sub("^\\W*", "", word)
word = re.sub("\\W*$", "", word)
return word
So I decided to replace the pair of substitutions with a single one:
def case2(word):
return re.sub("^\\W*(\\w+?)\\W*$", "\\1", word)
But I discovered that this decreased performance by 25 per cent. So I decided to try a simple regexp match instead:
def case3(word):
m = re.match("^\\W*(\\w+?)\\W*$", word)
if m:
return m.group(1)
This gave a 160 per cent improvement on the original. But I realised I could do still better if I abandoned regexps altogether and replaced them with some plain string processing:
def case4(word):
return word.strip(",.?!:;'\"")
Sure enough, this gave a 1200 per cent improvement over the original
re.sub()
code, which reduced the cost of the function to the point where a re.split()
in the caller become one of the dominant costs. Once this was replaced with an almost equivalent set of string splits, I found that the elapsed time of the program had been more than halved and the problem had moved from being CPU bound to being IO bound.The moral of the story, then, is that although regular expressions are often useful they significantly limit code scalability and should be avoided in codes where performance is likely to be important.