Sunday, November 2, 2014

Fault Intolerance

I recently read an article titled "Fault Intolerance" by Gerard Holzman, in the November issue of the IEEE magazine "Software" - the premise of the article is that more than software that tolerates faults, we need to create software that is INtolerant of faults and aggressively defends itself against faults.

On the way to his premise, he spoke about the push within academia to create software that can be formally proven to successfully satisfy the requirements of a formal specification.  Now, I've always been a bit skeptical of the practicality of this objective, but Gerard gave me a bit of ammunition in this with the following observation:
"The most difficult part of a formal specification, and the part that's most often incomplete, is to state clearly what should happen under various error conditions... Especially for complex systems, it can be very hard to forsee all possible off-normal events that a system might encounter in real life...Reality can be remarkably creative when it comes to showing where our imagination falls short"  (And I would add end-users to the list...:-)
So, where does that bring us? "Fault intolerance leads to a defensive posture in software design in which nothing is taken for granted."

And finally, he lays down the gauntlet with a standard:
"For well written code, we should expect to see an assertion density of at least 2 percent, meaning that two out of every 100 executable statements are assertions"
The example the author uses throughout the article is software designed to operate semi-autonomous spacecraft such as the Cassini spacecraft (orbiting Saturn) and the Voyager spacecraft (which recently entered interstellar space).  Now, most of us are not creating software for such demanding environments, and cannot afford the kind of double and triple redundancy that made these craft so reliable (the stamina of the Voyager is mind-boggling - to continue to operate with no physical modification for 37 years!  What earth-bound computer is still operating after 37 years?).

There was, however, a principle that I'd like to highlight and use as a metaphor: most of these systems are designed so that if a truly unique or unhandleable condition occurs, they revert to a safe mode - generally, this means to make sure the solar cells are pointed to the sun so that the batteries can be charged, and the antenna is pointed to earth so that they can get further instructions.  It is important to note that the system does not just shut down - it continues to operate in a minimal way that allows for recovery.

So, what does all of this mean to those of us building commercial systems?  First, I will argue that the focus of fault intolerance, and such guidelines as an assertion density of 2 percent, still apply. While we may not be dealing with a billion-dollar spacecraft, we are dealing with demanding users who are trying to use our software to get a job done.  And while some of the errors we have to deal with come from those very users, many of them come from random events in the universe (gamma radiation, sun spots, quirky configurations of an ancient single-sign-on service that interferes with our perfect software.)  Our users expect our software to continue to operate, even if we have to go into safe mode to do so.

So, what does safe mode mean for a commercial system?  First, to the extent possible, continue to operate in degraded mode.  Second, when you detect an extreme unexpected input or situation, go into safe mode gracefully and safely (e.g. don't write random data to the most critical database in your system - you want to be able to continue to operate after the error conditon is corrected).  And third, continue to operate at least enough to tell the user that something is wrong (and in fact, tell them to contact their system administrator, who might not yet be aware of the problem.)

Monday, March 10, 2014

Simplicity, or the art of NOT doing things


I happen to be particularly enamored of simplicity in art: the uncluttered style of oriental graphic art, the enigma of a Koan, the spare beauty of Haiku and the purity of a cappella voices.
Carrying that aesthetic to our chosen profession, we find an increasing focus on simplicity in software engineering.  One of the core principles that is part of the Agile Manifesto reads “Simplicity – the art of maximizing the amount of work not done – is essential”.  The lean movement puts additional emphasis on this.  Finally, the admonition YAGNI (“You Aren’t Going to Need It”) is at its core an appeal for simplicity.
So, why do we need all of this attention on a simple principle?  Because Simplicity is HARD and doesn’t come naturally to most western minds.
  • As software engineers, we’re paid to write code and we’re good at it, so our instincts (and/or our habits) are to write more of what we’re good at.
  • Our product owners have very long lists of features and requirements that will make our products richer and more competitive.

But the cost of complexity is insidious (and I’m focused on both aspects of the word: both evil and not easily noticed):
  • More code paths (one of the measures of complexity) represent more unit tests, more test cases, and more combinations and permutations.
  • More code paths result in higher entropy: that conditional execution that is so minor and obvious today will not be so obvious and will be forgotten 2 years from now when you’re doing maintenance.  The additional code paths greatly increase the chance that the bug fix or enhancement 2 years hence will break something
  • When complexity finds its way into the user interface, it increases the power and flexibility of the application, but at the expense of usability.  At best, the users grumble a bit, at worst, they stop using the product.

So, let me offer a few recommendations:
At the user interface, find a way to collapse a large number of high-level choices into a smaller number.  For example, many text processing applications will have menu items for Search, Global Search, Find in Files, Replace, Replace Globally.  In the end, those can all be handled by one Find function, keeping the top-level set of choices much smaller and more approachable.
In Axceler’s ControlPoint, we implemented Site-level permissions reports sorted by Site, Site-level permissions reports sorted by User, Comprehensive (detailed) permissions reports sorted by Site, and Comprehensive (detailed) permissions reports sorted by User.  With the clarity of hindsight, these would have benefitted from collapsing these into a single high-level Permissions Report, with a sort option, and a level of detail option.  (Now, a reality check on this: at some point, pushing too many choices down into detailed options on a function makes that function suffer from the complexity bug, so don’t carry it too far!  However, at the more detailed level, you have more choices to manage complexity through techniques such as “Advanced” options so you only see the complexity when you need or want it).
When considering boundary conditions, error conditions, and special cases (these have a way of surfacing in bug reports, so the discussion tends to happen in that context), it is tempting to create special validation or special handling for each unique situation.  Now, in some cases, that is appropriate, but you should resist the temptation: often the long term compounding effect of lots of little bits of complexity is greater than the incremental benefit of a uniquely tailored message or handler.  My last blog post actually provides a good example of this.  Apple chose to handle both “too hot” and “too cold” conditions in the same boundary condition handler.  I believe that this was an excellent choice that kept the logic simpler.  The only shortcoming (and a minor one at that) was that the message verbiage only acknowledged the most likely of the two possible conditions!
In closing, I’d like to quote a man who bridged the gap between art and technology (namely Leonardo daVinci) “Simplicity is the ultimate sophistication”.

Saturday, January 25, 2014

Error messages that don’t make sense

This time just an amusing observation…

As software engineers, we all struggle with the challenge of writing error messages that are meaningful and helpful, and from time to time, we all stumble into amusing blunders.  Even the king of good user interfaces, Apple, can fall victim:

I have an old iPhone 3 that I’ve been using essentially as an iPod, and I left it in my car.  One cold, sub-freezing morning recently I tried to start the iPod up, only to get the message:

Temperature
!
iPhone needs to cool down
before you can use it.

Of course, I ignored the instructions and warmed it up, and the message went away J


(I never knew the iPhone had a temperature sensor in it – apparently it’s not available to app developers…)