Stories: Granularity

In my last post, one of the recommendations I made was to "Keep on Truckin'", in other words keep the application running and accomplishing as much as can be done, even in the face of failures and bad inputs in the environment. In this article, I want to explore that a bit more, and specifically talk about the granularity of the response to errors.

But first, a side observation: Exception handing is one of the more wonderful things that Bjarne Stroustrup incorporated into C++ and that were adopted by derivative languages. While there is an overhead associated with Try..catch blocks, I would argue that the value of extensive and fine-grained use of exception handlers is well worth the relatively small cost. Fine-grained exception handling is a good thing.

Which brings us back to granularity: my advice is to trap, handle, and recover from errors in the environment at the smallest practicable granularity. So, what does that mean? Of course it is hard to say in general ("practical" is so subjective!), but I would say that the trapping of errors and/or the location of the try...catch blocks should be at least at the level of the lowest-level item included explicitly in an analysis, or acted upon by an action.

Some examples might help:

In Outlook, suppose that I've got a .pst file for my email from the 1980's (of course, I converted it from my ccMail client...), and I'm searching for email from Bill Gates. Suppose the report of the standardization committee on the Basic language was somehow corrupted. If the exception handling is at the level of the .pst file, I might not find any messages from Bill, but if the exception handling is at the level of the item, I'd miss the committee report, but I might see the "Thanks for your valuable suggestion" (No, I never got one of those, but if I did, you can be sure I'd hang onto it :-)
In ControlPoint, we implemented a Comprehensive Permissions Report that can report on permissions right down to the individual list item level - in this case, if we are unable to read the permissions on an individual item, we do our best to report on the permissions of the site, the lists, and any other items that we are able to read.

Note that I've expressed this in terms of exception handling, but remember that we still have API methods that report information in return codes, and/or in return values (as in, a null object pointer may be a null result) - which brings me to a pet peeve: there is no good reason to have an "object reference not set to an instance of an object" error - if your code reports that error, you probably weren't checking a return value. And a recommendation to code reviewers: this is one of the things you should check for.

OK, so you catch an exception or detect an null object reference at an appropriate level - now what do you do? It is worth keeping in mind that you actually have three audiences that need to know something about what happened: the end user who initiated the action, the IT staff that needs to know something went wrong, and the developer and/or support team responsible for the application - each of those wants something slightly different:
User

Of course, you owe it to the user to tell him/her that something went wrong, and that the report s/he asked for is incomplete or that an action wasn't carried out as fully as the user expected. If the report or action has a status message, a completion message like "Completed with errors" is a good balance - you tell the user that results are available, but they are not 100%...

But how do you communicate what is missing, e.g. which sites are not included in the tally of storage, which sites are missing the list of running workflows, etc. For reports, the best solution is to either replace the object name with a string like *** Error *** if you are unable to fetch the name of the object, or append a string to the name if you have the name, but not an important property value. Note that you want to make sure that the string is searchable - in a 3,000 page report, you can't rely on manual inspection to pick out the items with the errors - help the computer to do what computers do best!

In actions, if the action has some form of task audit report (always a good idea!), then simply list the objects that weren't processed because of errors in the task audit report

IT Staff / Developer / Support

While a lot of technical detail will cause most end users to glaze over, most IT staff can tolerate technical detail, even if they don't fully understand it. In other words, you can probably combine the information you want to communicate to the IT staff with the information you want to communicate back to yourself (as a quick preview of a later post - remember, what happens if you take a support call: *you* are one of the audiences you're writing this information for!).

We'll talk about what information to capture and where in a later post, but for now, think about the information that the IT staff needs - I will mention a few things to be sure you include:

What action was the user in the middle of (remember, the log file may contain information about a lot of different user's activities - clarifying the action can help to correlate the log entry to the user's action)
What object did the error occur on? (If the user ran a report on the entire farm, s/he won't know which site was the corrupt one - helping the IT manager find that site can enable them to do something about it.
The message returned from the API - often this contains exactly the information that the IT person needs to know. (E.g. to illustrate with a common problem - if the password changes on a service account, often the error message will tell the IT person exactly what the problem is (maybe even without calling vendor support!)

Note that there is a real dichotomy between the information you present to the user, and the information needed by IT - a couple of notes here:

There are a small number of customers in highly secure environments (curiously, these tend to be in financial industries, rather then military or security agencies) for whom display of API-generated messages to end users is a security risk (since the API message may expose information about the system). You may need to accommodate those users. On the other hand, displaying messages from the API directly to users can often help debugging, both in QA, and when support is online with the customer and can watch what is happening on the screen, so it is very tempting. In ControlPoint, we introduced a customer-settable option that specified whether error details should be displayed to the user.
If you want to display detailed technical information to the user, it would be best to prefix it with something like "Please provide the following information to your help desk or IT department" - that gives the end user permission to not to try to understand it.
While it can be useful to display information about what source data gave rise to the error (e.g. the name of the site that the report couldn't report on), be careful that you don't inadvertently expose information. If the user asked for the storage consumed by the HR site collection, telling the user that the site \HR\Layoffs\AlfredENeumann is inaccessible might expose information the user shouldn't know.
In this vein, the model adopted by SharePoint starting in version 2010 of presenting the user with a sparse message containing a correlation ID works very nicely - the message seen by the user has very little information (maybe even too little), but the correlation ID allows the IT department to find the exact error message in the ULS logs - the only thing they missed was to warn the user that they'd actually need the correlation ID before they dismiss the dialog...

Stories

Wednesday, November 6, 2013

Granularity

No comments:

Post a Comment