Sunday, November 2, 2014

Fault Intolerance

I recently read an article titled "Fault Intolerance" by Gerard Holzman, in the November issue of the IEEE magazine "Software" - the premise of the article is that more than software that tolerates faults, we need to create software that is INtolerant of faults and aggressively defends itself against faults.

On the way to his premise, he spoke about the push within academia to create software that can be formally proven to successfully satisfy the requirements of a formal specification.  Now, I've always been a bit skeptical of the practicality of this objective, but Gerard gave me a bit of ammunition in this with the following observation:
"The most difficult part of a formal specification, and the part that's most often incomplete, is to state clearly what should happen under various error conditions... Especially for complex systems, it can be very hard to forsee all possible off-normal events that a system might encounter in real life...Reality can be remarkably creative when it comes to showing where our imagination falls short"  (And I would add end-users to the list...:-)
So, where does that bring us? "Fault intolerance leads to a defensive posture in software design in which nothing is taken for granted."

And finally, he lays down the gauntlet with a standard:
"For well written code, we should expect to see an assertion density of at least 2 percent, meaning that two out of every 100 executable statements are assertions"
The example the author uses throughout the article is software designed to operate semi-autonomous spacecraft such as the Cassini spacecraft (orbiting Saturn) and the Voyager spacecraft (which recently entered interstellar space).  Now, most of us are not creating software for such demanding environments, and cannot afford the kind of double and triple redundancy that made these craft so reliable (the stamina of the Voyager is mind-boggling - to continue to operate with no physical modification for 37 years!  What earth-bound computer is still operating after 37 years?).

There was, however, a principle that I'd like to highlight and use as a metaphor: most of these systems are designed so that if a truly unique or unhandleable condition occurs, they revert to a safe mode - generally, this means to make sure the solar cells are pointed to the sun so that the batteries can be charged, and the antenna is pointed to earth so that they can get further instructions.  It is important to note that the system does not just shut down - it continues to operate in a minimal way that allows for recovery.

So, what does all of this mean to those of us building commercial systems?  First, I will argue that the focus of fault intolerance, and such guidelines as an assertion density of 2 percent, still apply. While we may not be dealing with a billion-dollar spacecraft, we are dealing with demanding users who are trying to use our software to get a job done.  And while some of the errors we have to deal with come from those very users, many of them come from random events in the universe (gamma radiation, sun spots, quirky configurations of an ancient single-sign-on service that interferes with our perfect software.)  Our users expect our software to continue to operate, even if we have to go into safe mode to do so.

So, what does safe mode mean for a commercial system?  First, to the extent possible, continue to operate in degraded mode.  Second, when you detect an extreme unexpected input or situation, go into safe mode gracefully and safely (e.g. don't write random data to the most critical database in your system - you want to be able to continue to operate after the error conditon is corrected).  And third, continue to operate at least enough to tell the user that something is wrong (and in fact, tell them to contact their system administrator, who might not yet be aware of the problem.)