As a software engineer, my first reaction was "Wow - they had to get that exactly right". As a Software Development manager, my second reaction was to think about what lessons other less critical software projects could take from that. There are at least three that come to mind:
1) Some things need more care than others.
That lesson can be applied at multiple levels - first, some applications are really critical (the automatic defibrillator could kill you if it fails; the navigation system for the Space Shuttle had to be exactly right or the astronauts wouldn't get home again...), but in contrast, a report on what class of users prefer Kleenex over the store brand isn't going to kill anyone if it is off by 10%.
At another level, within an application, some functions are more critical than others. Windows XP used to (I guess it still does for anyone who's running it...) contain a neat defragmenter that started with a very colorful diagram showing you how badly fragmented the disk is, and then proceeded to move sectors around on disk to reduce the fragmentation. Now, if the report was wrong, frankly, most of us would never know, and probably wouldn't care a lot. On the other hand if it lost a sector on disk during the defragmentation, that could corrupt a really critical file (my status report to my bosss, the grant application for my foundation, my resume...). My hope and expectation is that the team creating that defragmenter spent a lot more time and care on the design, code reviews and testing of the defragmenation aspects of the program than they did on the display.
2) Keep on Truckin'
(Google it, if you're not familiar with the phrase...)
Back to the defibrillator - if it encounters an error (lets say the battery level drops momentarily, or the user gets too close to a microwave that temporarily messes up the sensor that detects the heart's contractions), the defibrillator can't exactly display "Error, unexpected input, please restart" - it has to ignore the spurious input, keep running, and wait for the error condition to clear itself (and if possible, clear the condition).
So how does that apply to those of us who are more focused on enterprise software. It turns out that this is one of the hardest lessons for most application developers - the "Error, unexpected input" response is our first reaction, and it does have the advantage of alerting the user to a problem. But for highly available software (that is what you want to build, isn't it?) that's not the best response.
Consider a mail server; while it normally isn't as life-critical as a defibrillator, mail has become a business critical function in today's world. If your mail server encounters an SMTP protocol error when Joe's Garage is trying to send a "your car is ready" message, you don't want the mail server to shut down and wait for Joe to get his act together. Instead, you want it to continue to listen for other email (the announcement that your company just won that million dollar deal, for instance. or the support question from a customer). If the disk defragmenter encounters a bad sector on a write, you'd like it to mark it as bad, and try another one; or if it encounters a bad sector on a read, you want it to put as much of the file together as it can. If the SharePoint search indexer encounters a corrupted site, you want it to continue to crawl all of the other sites, so that you can search for *most* of the content of the farm.
Now, this needs a bit of care, which begins to overlap with my next lesson. Consider the disk defragmenter - if it gets 100 bad sectors in a row, that's probably an indication that the disk or the controller has gone bad in a big way - continuing to try to defragment the disk could very possibly make things a lot worse. If the mail server gets 20 errors in a row trying to send a message to Joe's Garage ("I'll pick it up at 6:00 pm"), it's probably a waste of time trying the 21st time.
In short, if your "Keep on Truckin'" response is to retry an operation, you normally want an upper bound on the retries, or you could spend more time on the retries than on productive work, which in the end would defeat the "Keep on Truckin'" objective. While your at it, let your customer override your default number of retries (because someone is going to want a different threshold than the rest of your customers.)
3) "First do no harm"
(While those words are not actually in the Hippocratic oath, the spirit of it is there...) In more mundane terms, we need to consider the failure mode of our software.
Continuing with the example of the defibrillator - consider what the defibrillator should do when the battery level begins to drop or even if there is a power surge - you don't want the defibrillator to go into defib mode just because the inputs are out of range - you'd rather it go into quiet mode until things become normal.
So, what does this mean for enterprise software? I'll use an example from Axceler's ControlPoint - the product includes a function to move SharePoint sites - it does this by exporting a copy of the site, importing that copy to the new location, and then as an optionally delayed operation, deleting the source site. First of all, if the import of the copy of the file to the destination has an error, we do not want to delete the source file. Similarly, the Move function has an option to take a separate backup before the move - if the backup has an error, then again, we do not want to delete the source file. In all cases, the response to errors (the failure mode) is to do nothing. While this is fairly obvious when stated this way, it can be easy to miss if you don't think about it while building the function.
Finally, at the functional design level, there are some less obvious things that can be done. We implemented the delayed delete of the source partly to provide a validation and recovery option: if a problem occured with the destination, even a subtle one that the code could not detect, then the user has the the time to notice, and has the option to cancel the delete and revert back to the original site.
No comments:
Post a Comment