Speeding up a logfile parser
Oct. 12th, 2012 04:56 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Looking into the generally sluggish turnaround of the batch system, I decided that the problem was probably down to long cycle times in the negotiator daemon's dispatch code and that it was being exacerbated by full debug logging. To prove my hypothesis, I decided to write something to plot out the cycle times before and after changing the debug flags. However while the script proved easy enough to write, the sizes of the log files meant that the performance simply wasn't good enough. So I said down with the python profiler and the
The profiler showed that the bulk of the time was being spent in the log file parsing function. No great surprise there — the function was crunching through hundreds of gigs of data to pick out the cycle times. What was surprising was the relatively high cost of theendswith() with a less expensive
As a result of the changes time required to run my testcase dropped to around an eighth of its original value, bringing the time taken to analysis a complete set of logs down to a mere 700 seconds. Then, when I finally plotted out the distiled data, I noticed that as soon as I disabled full debugging the cycle time droped from around 20-30 seconds to 3-4 seconds and stayed stable. Excelsior!
timeit
to see if I couldn't hurry things along their way.The profiler showed that the bulk of the time was being spent in the log file parsing function. No great surprise there — the function was crunching through hundreds of gigs of data to pick out the cycle times. What was surprising was the relatively high cost of the
endswith()
string method. And because it was being called for almost ever line of input, it was dominating the performance of the program. So, optimisation number one: reorder the string comparisons to try to break out of the matching early and replace the in
dateutil.parser had become more prominent in the profiler listing. I tried a couple of easy tricks, removing some unnecessary rounding operations and trimming back the length of the date string to be parsed, but extracting the date of each log entry was still far too expensive. So, optimation number two: replace the generalised dateutil.parser
routine with a custom bit of code and incorporate caching to avoid re-parsing the line when the date change is less than the number of significant figures.As a result of the changes time required to run my testcase dropped to around an eighth of its original value, bringing the time taken to analysis a complete set of logs down to a mere 700 seconds. Then, when I finally plotted out the distiled data, I noticed that as soon as I disabled full debugging the cycle time droped from around 20-30 seconds to 3-4 seconds and stayed stable. Excelsior!