Tuning a text analyser
Feb. 12th, 2010 03:44 pmA few months ago I wrote a python script to count the words, sentences, paragraphs, and word frequencies in my MA thesis. Having written it in something of a hurry, I discovered when I returned to it yesterday, that it wasn't terribly efficient and didn't scale terribly well when run against larger samples of text. So I decided to take a look with the profiler to see if I couldn't improve things.
I took the raw program and profiled it using largest chunk of text I had to hand, George Eliot's Middlemarch, as it happens. I noticed almost immediately that most of the code spent most of its time in the routine that updated the word frequency hash, and that the routine spent most of this time running a series of regular expressions to clean off any non-alphanumeric characters prior to updating the frequency hash. Suspecting that this wasn't very efficient, I decided to benchmark a few alternatives.
My original code looked something like the following:
def case1(word):
word = re.sub("^\\W*", "", word)
word = re.sub("\\W*$", "", word)
return word
So I decided to replace the pair of substitutions with a single one:
def case2(word):
return re.sub("^\\W*(\\w+?)\\W*$", "\\1", word)
But I discovered that this decreased performance by 25 per cent. So I decided to try a simple regexp match instead:
def case3(word):
m = re.match("^\\W*(\\w+?)\\W*$", word)
if m:
return m.group(1)
This gave a 160 per cent improvement on the original. But I realised I could do still better if I abandoned regexps altogether and replaced them with some plain string processing:
def case4(word):
return word.strip(",.?!:;'\"")
Sure enough, this gave a 1200 per cent improvement over the original
The moral of the story, then, is that although regular expressions are often useful they significantly limit code scalability and should be avoided in codes where performance is likely to be important.
I took the raw program and profiled it using largest chunk of text I had to hand, George Eliot's Middlemarch, as it happens. I noticed almost immediately that most of the code spent most of its time in the routine that updated the word frequency hash, and that the routine spent most of this time running a series of regular expressions to clean off any non-alphanumeric characters prior to updating the frequency hash. Suspecting that this wasn't very efficient, I decided to benchmark a few alternatives.
My original code looked something like the following:
def case1(word):
word = re.sub("^\\W*", "", word)
word = re.sub("\\W*$", "", word)
return word
So I decided to replace the pair of substitutions with a single one:
def case2(word):
return re.sub("^\\W*(\\w+?)\\W*$", "\\1", word)
But I discovered that this decreased performance by 25 per cent. So I decided to try a simple regexp match instead:
def case3(word):
m = re.match("^\\W*(\\w+?)\\W*$", word)
if m:
return m.group(1)
This gave a 160 per cent improvement on the original. But I realised I could do still better if I abandoned regexps altogether and replaced them with some plain string processing:
def case4(word):
return word.strip(",.?!:;'\"")
Sure enough, this gave a 1200 per cent improvement over the original
re.sub()
code, which reduced the cost of the function to the point where a re.split()
in the caller become one of the dominant costs. Once this was replaced with an almost equivalent set of string splits, I found that the elapsed time of the program had been more than halved and the problem had moved from being CPU bound to being IO bound.The moral of the story, then, is that although regular expressions are often useful they significantly limit code scalability and should be avoided in codes where performance is likely to be important.