Blame yourself first

I was periodically helping and consulting adjacent teams working on enterprise applications for a big international corporation. And I was requested to help with a strange situation.

The team told me their story, it sounded like they re-checked themselves many times and they were sure their code was correct. But it did not work as expected.

I plunged into the analysis of the related code, it looked like we were fine and the stereotypical statement was floating in the air — "Our code is correct, the computer is just wrong".

I reminded our frequently used rule — "Blame yourself first". It means that as long as we had no such technical issue before it’s highly likely introduced by our recent change. So, I began to narrow the scope by simplifying the code, removing the parts which are not necessary for the investigation, and testing various hypotheses along the road.

And after I reached the simplest state of the code it was obvious that the code is fine, and this is the only way we can define our logic. Hence, I put the rule aside and started to think about the application’s environment and the third-party software involved.

Trust, but verify?

Normally, we treat existing software as a bug-free perfectly working masterpiece. It’s a kind of set of axioms in software development, the foundation we stand on. It would be insane otherwise. But I reached the point when it was time to assess other parts, outside of our code. And the first suspect in this case was the MySQL database.

I prepared a very simple SQL scenario, based purely on the official mysql CLI, away from any business specifics and piles of existing code. It verifies the mechanism of MySQL itself. And, fortunately, or unfortunately, it strictly showed a defect in the MySQL database itself. The database creates a deadlock situation when, according to its specification and SQL standards, it must not do so.

Fortunately or unfortunately?

I like to joke about and share the idea that we should be thankful if an issue is created by us. It comes from my experience, and the reasoning is that, even if we would prefer to blame others instead of ourselves, anyway we are the ones who are assigned to make things work. And it’s much easier and cheaper to fix our code which is well understood than scratching our heads over foreign code. And it may be worse, when the code is not available, or we would expect some open source project to fix itself and release a new version over a single night, but our deadlines are not their problem. I have real examples in my practice when a project brings in custom Java runtime, patched Spring, and other unwanted complexities to maintain because the business cannot wait for the mainstream to make its move.

I submitted the bug report, but the team could not wait for the official fix and had to re-think the code to workaround the defect. It somewhat degraded application performance, but fortunately it’s an enterprise app, not a worldwide B2C, i.e. it’s more important to get numbers right instead of instant content delivery for millions of users online. Anyway, the project was planning to migrate to Oracle DB, so there was no need to invest much resources into MySQL.

Other "we"

Other teams, which work on our dependencies like databases and varied libraries, also face the "Blame yourself first" rule, they are also humans and may introduce bugs. From a philosophical point of view, we may dig deeper, down to operating systems and hardware, being done by humans too, and probably the only thing we can treat as a bug-free one is Nature itself when Physics is the last layer to blame (hello bit flips and ECC memory) due to we do not understand the world completely yet (if it’s even possible).

Conclusions

This is a good practice to track the recent changes first, but sometimes we may be stuck with a single hypothesis for too long and dig too deep going in the wrong direction. We should remember that soon enough we should step back and revise our attack vectors, especially if the deadlines are tight. And it’s okay to cast crazy ideas like "What if the DB fails or even the OS" — the pragmatism and particular context will guide us.

 
 

Copyright © Igor Ostapenko
(handmade content)