Fast turnaround of fixes...
Oct. 9th, 2017 04:50 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Successful morning working to debug a series of problems that we've been working on for at least the last three years. Every time we run a test, we get slightly closer to a solution; but every time we try, uncover yet another glitch, making our approach to a solution somewhat asymptotic.
After carefully enabling debugging — something which, when switched on across the board, generates so much data that it causes everything to grind to a halt in a matter of minutes — and tracing a job, I noticed that the start of the job coincided with a periodic cleanup event. Checking the source code, I noticed the cleanup was using a negative match to determine what to remove, confirming my suspicion that either a race condition or a type mismatch was to blame.
The person I was working with was in a chat session with the developer. They mentioned that we thought the failure was caused by a race with the periodic event but didn't provide any further details. Within seconds, they'd got a reply which effectively restated our hypothesis. Then, a few seconds after that, they got another message from the developer saying that they were in the process of producing a fix and could we please send the logs for confirmation.
It took longer to work out how to transfer the logs than it took to develop a first cut fix for the problem...
After carefully enabling debugging — something which, when switched on across the board, generates so much data that it causes everything to grind to a halt in a matter of minutes — and tracing a job, I noticed that the start of the job coincided with a periodic cleanup event. Checking the source code, I noticed the cleanup was using a negative match to determine what to remove, confirming my suspicion that either a race condition or a type mismatch was to blame.
The person I was working with was in a chat session with the developer. They mentioned that we thought the failure was caused by a race with the periodic event but didn't provide any further details. Within seconds, they'd got a reply which effectively restated our hypothesis. Then, a few seconds after that, they got another message from the developer saying that they were in the process of producing a fix and could we please send the logs for confirmation.
It took longer to work out how to transfer the logs than it took to develop a first cut fix for the problem...