I recently came across this very interesting presentation and article related to Unit Tests for distributed systems and felt it was worth sharing.
From the technical article...
almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.
and this one...
in 58% of the catastrophic failures, the underlying faults could easily have been detected through simple testing of error handling code.
This sentence really caught my attention...
In fact, in 35% of the catastrophic failures, the faults in the error handling code fall into three trivial patterns: (i) the error handler is simply empty or only contains a log printing statement, (ii) the error handler aborts the cluster on an overly-general exception, and (iii) the error handler contains expressions like “FIXME” or “TODO” in the comments.
If you are working on distributed systems and are wondering where to put effort in automated testing, it might be worth grabbing a coffee and spending some time on this.
At a minimum, it might have you think about just scanning your code for the word FIXME or TODO in catch blocks and put an end to that !
There's more that I could add, but it's probably best if you just read this article for yourself.