Wednesday, November 6, 2013

Granularity

In my last post, one of the recommendations I made was to "Keep on Truckin'", in other words keep the application running and accomplishing as much as can be done, even in the face of failures and bad inputs in the environment.  In this article, I want to explore that a bit more, and specifically talk about the granularity of the response to errors.

But first, a side observation: Exception handing is one of the more wonderful things that Bjarne Stroustrup incorporated into C++ and that were adopted by derivative languages.  While there is an overhead associated with Try..catch blocks, I would argue that the value of extensive and fine-grained use of exception handlers is well worth the relatively small cost.  Fine-grained exception handling is a good thing.

Which brings us back to granularity: my advice is to trap, handle, and recover from errors in the environment at the smallest practicable granularity.  So, what does that mean?  Of course it is hard to say in general ("practical" is so subjective!), but I would say that the trapping of errors and/or the location of the try...catch blocks should be at least at the level of the lowest-level item included explicitly in an analysis, or acted upon by an action.

Some examples might help:
  • In Outlook, suppose that I've got a .pst file for my email from the 1980's (of course, I converted it from my ccMail client...), and I'm searching for email from Bill Gates.  Suppose the report of the standardization committee on the Basic language was somehow corrupted.  If the exception handling is at the level of the .pst file, I might not find any messages from Bill, but if the exception handling is at the level of the item, I'd miss the committee report, but I might see the "Thanks for your valuable suggestion"  (No, I never got one of those, but if I did, you can be sure I'd hang onto it :-)
  • In ControlPoint, we implemented a Comprehensive Permissions Report that can report on permissions right down to the individual list item level - in this case, if we are unable to read the permissions on an individual item, we do our best to report on the permissions of the site, the lists, and any other items that we are able to read.
Note that I've expressed this in terms of exception handling, but remember that we still have API methods that report information in return codes, and/or in return values (as in, a null object pointer may be a null result) - which brings me to a pet peeve: there is no good reason to have an "object reference not set to an instance of an object" error - if your code reports that error, you probably weren't checking a return value.  And a recommendation to code reviewers: this is one of the things you should check for.

OK, so you catch an exception or detect an null object reference at an appropriate level - now what do you do?  It is worth keeping in mind that you actually have three audiences that need to know something about what happened: the end user who initiated the action, the IT staff that needs to know something went wrong, and the developer and/or support team responsible for the application - each of those wants something slightly different:
 User
Of course, you owe it to the user to tell him/her that something went wrong, and that the report s/he asked for is incomplete or that an action wasn't carried out as fully as the user expected.  If the report or action has a status message, a completion message like "Completed with errors" is a good balance - you tell the user that results are available, but they are not 100%...
 
But how do you communicate what is missing, e.g. which sites are not included in the tally of storage, which sites are missing the list of running workflows, etc.  For reports, the best solution is to either replace the object name with a string like *** Error *** if you are unable to fetch the name of the object, or append a string to the name if you have the name, but not an important property value.  Note that you want to make sure that the string is searchable - in a 3,000 page report, you can't rely on manual inspection to pick out the items with the errors - help the computer to do what computers do best!

In actions, if the action has some form of task audit report (always a good idea!), then simply list the objects that weren't processed because of errors in the task audit report

IT Staff / Developer / Support
While a lot of technical detail will cause most end users to glaze over, most IT staff can tolerate technical detail, even if they don't fully understand it.  In other words, you can probably combine the information you want to communicate to the IT staff with the information you want to communicate back to yourself (as a quick preview of a later post - remember, what happens if you take a support call: *you* are one of the audiences you're writing this information for!).
 
We'll talk about what information to capture and where in a later post, but for now, think about the information that the IT staff needs - I will mention a few things to be sure you include:
  • What action was the user in the middle of (remember, the log file may contain information about a lot of different user's activities - clarifying the action can help to correlate the log entry to the user's action)
  • What object did the error occur on?  (If the user ran a report on the entire farm, s/he won't know which site was the corrupt one - helping the IT manager find that site can enable them to do something about it.
  • The message returned from the API - often this contains exactly the information that the IT person needs to know.  (E.g. to illustrate with a common problem - if the password changes on a service account, often the error message will tell the IT person exactly what the problem is (maybe even without calling vendor support!)
Note that there is a real dichotomy between the information you present to the user, and the information needed by IT - a couple of notes here: 
  • There are a small number of customers in highly secure environments (curiously, these tend to be in financial industries, rather then military or security agencies) for whom display of API-generated messages to end users is a security risk (since the API message may expose information about the system).  You may need to accommodate those users.  On the other hand, displaying messages from the API directly to users can often help debugging, both in QA, and when support is online with the customer and can watch what is happening on the screen, so it is very tempting.  In ControlPoint, we introduced a customer-settable option that specified whether error details should be displayed to the user. 
  • If you want to display detailed technical information to the user, it would be best to prefix it with something like "Please provide the following information to your help desk or IT department" - that gives the end user permission to not to try to understand it.
  • While it can be useful to display information about what source data gave rise to the error (e.g. the name of the site that the report couldn't report on), be careful that you don't inadvertently expose information.  If the user asked for the storage consumed by the HR site collection, telling the user that the site \HR\Layoffs\AlfredENeumann is inaccessible might expose information the user shouldn't know.
  • In this vein, the model adopted by SharePoint starting in version 2010 of presenting the user with a sparse message containing a correlation ID works very nicely - the message seen by the user has very little information (maybe even too little), but the correlation ID allows the IT department to find the exact error message in the ULS logs - the only thing they missed was to warn the user that they'd actually need the correlation ID before they dismiss the dialog...

Monday, November 4, 2013

Lessons from medical device software

A number of years ago, I worked with someone whose husband had an automatic, implantable defibrillator to address heart problems.

As a software engineer, my first reaction was "Wow - they had to get that exactly right".  As a Software Development manager, my second reaction was to think about what lessons other less critical software projects could take from that.  There are at least three that come to mind:

1) Some things need more care than others.

That lesson can be applied at multiple levels - first, some applications are really critical (the automatic defibrillator could kill you if it fails; the navigation system for the Space Shuttle had to be exactly right or the astronauts wouldn't get home again...), but in contrast, a report on what class of users prefer Kleenex over the store brand isn't going to kill anyone if it is off by 10%.

At another level, within an application, some functions are more critical than others.  Windows XP used to (I guess it still does for anyone who's running it...) contain a neat defragmenter that started with a very colorful diagram showing you how badly fragmented the disk is, and then proceeded to move sectors around on disk to reduce the fragmentation.  Now, if the report was wrong, frankly, most of us would never know, and probably wouldn't care a lot.  On the other hand if it lost a sector on disk during the defragmentation, that could corrupt a really critical file (my status report to my bosss, the grant application for my foundation, my resume...).  My hope and expectation is that the team creating that defragmenter spent a lot more time and care on the design, code reviews and testing of the defragmenation aspects of the program than they did on the display.

2) Keep on Truckin'

(Google it, if you're not familiar with the phrase...)

Back to the defibrillator - if it encounters an error (lets say the battery level drops momentarily, or the user gets too close to a microwave that temporarily messes up the sensor that detects the heart's contractions), the defibrillator can't exactly display "Error, unexpected input, please restart" - it has to ignore the spurious input, keep running, and wait for the error condition to clear itself (and if possible, clear the condition).

So how does that apply to those of us who are more focused on enterprise software.  It turns out that this is one of the hardest lessons for most application developers - the "Error, unexpected input" response is our first reaction, and it does have the advantage of alerting the user to a problem.  But for highly available software (that is what you want to build, isn't it?) that's not the best response.

Consider a mail server; while it normally isn't as life-critical as a defibrillator, mail has become a business critical function in today's world.  If your mail server encounters an SMTP protocol error when Joe's Garage is trying to send a "your car is ready" message, you don't want the mail server to shut down and wait for Joe to get his act together.  Instead, you want it to continue to listen for other email (the announcement that your company just won that million dollar deal, for instance. or the support question from a customer).  If the disk defragmenter encounters a bad sector on a write, you'd like it to mark it as bad, and try another one; or if it encounters a bad sector on a read, you want it to put as much of the file together as it can.  If the SharePoint search indexer encounters a corrupted site, you want it to continue to crawl all of the other sites, so that you can search for *most* of the content of the farm.

Now, this needs a bit of care, which begins to overlap with my next lesson.  Consider the disk defragmenter - if it gets 100 bad sectors in a row, that's probably an indication that the disk or the controller has gone bad in a big way - continuing to try to defragment the disk could very possibly make things a lot worse.  If the mail server gets 20 errors in a row trying to send a message to Joe's Garage ("I'll pick it up at 6:00 pm"), it's probably a waste of time trying the 21st time.

In short, if your "Keep on Truckin'" response is to retry an operation, you normally want an upper bound on the retries, or you could spend more time on the retries than on productive work, which in the end would defeat the "Keep on Truckin'" objective.  While your at it, let your customer override your default number of retries (because someone is going to want a different threshold than the rest of your customers.)

3) "First do no harm"

(While those words are not actually in the Hippocratic oath, the spirit of it is there...)  In more mundane terms, we need to consider the failure mode of our software.

Continuing with the example of the defibrillator - consider what the defibrillator should do when the battery level begins to drop or even if there is a power surge - you don't want the defibrillator to go into defib mode just because the inputs are out of range - you'd rather it go into quiet mode until things become normal.

So, what does this mean for enterprise software?  I'll use an example from Axceler's ControlPoint - the product includes a function to move SharePoint sites - it does this by exporting a copy of the site, importing that copy to the new location, and then as an optionally delayed operation, deleting the source site.  First of all, if the import of the copy of the file to the destination has an error, we do not want to delete the source file.  Similarly, the Move function has an option to take a separate backup before the move - if the backup has an error, then again, we do not want to delete the source file.  In all cases, the response to errors (the failure mode) is to do nothing.  While this is fairly obvious when stated this way, it can be easy to miss if you don't think about it while building the function.

Finally, at the functional design level, there are some less obvious things that can be done.  We implemented the delayed delete of the source partly to provide a validation and recovery option: if a problem occured with the destination, even a subtle one that the code could not detect, then the user has the the time to notice, and has the option to cancel the delete and revert back to the original site.

Who am I, and why this blog

I've been building software, generally enterprise-level COTS (Commercial Off The Shelf) software for longer than I care to admit.  I've done it in some very large organizations (Wang Laboratories, and Digital Equipment Corporation) and some very small, 20 person organizations.  I've had the chance to work with some brilliant people who showed me what to do right, and some folks who showed me things NOT to do.  As I have tried to impart that learning to my teams, I tend to fall back on two things: stories and metaphors (and in the end, they're not all that different).

So this blog is an opportunity to capture some of those stories, and occasionally metaphors, in the hopes that some folks will find it useful.