Stories: November 2013

Monday, November 25, 2013

Heisenbugs

Before Heisenberg started cooking meth, his namesake Heisenberg was a physicist known for the Uncertainty Principle, which says that you cannot know both the location and the momentum of a sub-atomic particle. This is related to the scientific principle called the Observer Effect, which states that the process of observing a system changes the system under observation. (This holds for systems ranging from sub-atomic particles to sociological studies to reality TV).

So, what does this have to do with software? While the name Heisenbug is very tongue-in-cheek, it describes a very real problem that occurs occasionally, and is actually related to the observer effect: a Heisenbug is a bug that disappears as soon as soon as you start to analyze it. (In my mind, it is the second most annoying type of bug after the type that occurs only on premises at the million dollar customer who is so secure that they cannot even send log files.)

Some examples of Heisenbugs that I've seen:

A number of years ago at Digital Equipment Corporation we created a highly scalable, heavily multi-threaded mail server that supported both x.400 and SMTP/Mime. Somewhere in the code was a subtle threading problem that we presume had to do with locking between threads - it generated an error at only one customer (a major Air Force base). So, to find itm, we did what any software engineering group does - we added a lot of logging to tell us exactly where the error was occurring, and what lead up to it. The problem went away, we presume because the logging changed the timing of the locking just enough to avoid the race condition. After months of trying to fine tune the logging to get the information we needed without changing the behavior, we gave up and just ran the system with the logging enabled, and once a day asked the engineer on site to delete the log files...
More recently, at Axcler, we were fighting with a problem at one of those million-dollar customers and it was the only place the problem manifested itself. Again, we added more logging (since there was no option of running a debugger on site...), and again, the problem disappeared. In this case, we conjecture that the observer effect came from the fact that the debugging code, which was dumping out information about the SarePoint farm, may have initialized or reset the state of the SharePoint API, thus providing a valid return on something that the core application logic needed.

So, what can we do about Hiesenbugs? Unfortunately, there often is no good answer. Often the solution is an example of the messy side of software engineering in practice. More often than we would like, we have bits of logic in our production systems that are there just because they work and we don't always know exactly why. While we look at them and say "yuck" (or to use a different metaphor, they "smell"), the cost to figure out what is really going on may not make economic sense (do you want to satisfy your scientific curiosity, or would you rather implement a new feature?). Or if the problem occurs only at a customer site, the process of figuring out why may be just too annoying to the customer. Just as doctors will sometimes just treat a rash with cortisone without fully knowing what caused it, software engineers occasionally treat the symptom without explaining the cause...

Tuesday, November 12, 2013

Support

Yes, the topic of support covers a lot of ground – for this post, I’d like to focus on the things we should do as developers to be ready for support.

First, a reminder: we are imperfect beings in a messy and imperfect world. That means that the wonderful, perfect application you are in the process of creating is going to fail, (unless it never gets used, but of course, that’s a different problem…). To invoke a bit of Murphy’s law, it is going to fail at the worst possible time (probably just before the end of the last sprint of the next release), in the hands of your most critical customer, and in the most subtle and obscure way that can’t be recreated in-house. And of course, because you wrote that code, they’re going to ask you to figure out what’s wrong. So, what’s your strategy? You should be selfish, and set yourself up to handle that call as quickly and as effectively as possible when it comes in. The good news – your selfish objectives match those of your employer, and the end-user!

Let’s remember some of the challenges you and your compatriots in the front-line support team face:

Very often, the complaint from the customer takes the form of “it didn’t work” (to which you want to ask, what DID it do?), “the action never completed”, “I got an error message” (did you write it down or take a screen shot?). Often the user doesn’t remember what they were doing or even when (to be fair, they weren’t expecting to have problems, so weren’t paying careful attention…). In some cases (e.g. for national security agencies, or some military installations) you don’t get to watch the error, and/or they need to “sanitize” log files before sending them. When the error is happening on a production system, usually your options for experimentation are limited; and in some cases, you need to wait as much as a couple of weeks for a deployment window to install patches or even diagnostic code. And finally, every customer’s configuration is a bit different, and some of those differences create unique situations that can be hard to even notice. (Examples: the customer that had 300 Active Directory Domains – for most actions, this was actually fine, until we needed to find a user that was no longer in his original domain –checking all 300 domains took a while… Or the customer that blocked access to some branches of the Active Directory Hierarchy, but only for some accounts.)

When thinking about how to anticipate and respond to issues, keep in mind that you have a number of audiences that need to know about the error:

The end user who was trying to do something – you owe him or her some information that the action or analysis didn’t go fully as planned. You also need to be careful not to give him/her too much information (see the last post mostly talks about this, so I won’t get into this communication in this post).

The IT staff needs more information to help the user with his/her problem, and/or who needs to fix an underlying problem (a corrupt SharePoint site, a missing web part, missing permissions, …)

The operations staff needs to know about some kinds of errors (if the SQL server is down, or network connections aren’t being made, or your application has irretrievably crashed, they need to take action. (Note that in some environments, the operations staff are separate from the IT staff, sometimes in a different location, and sometimes working for a different vendor. One example of this is Microsoft Online Dedicated SharePoint environments – the operations staff are Microsoft data center people, but what I’m calling the IT staff above are customer employees (or contracted employees) with limited access to the operations staff.)

You! (or your fellow developers) who may be called upon to diagnose and work around the problem.

Where do you capture this information?

For the operations staff, often they are using big expensive monitoring software (such as Microsoft’s System Center) to alert them when something needs their attention. Generally, the best way to make information available to these tools is to place information into the Windows Event log. In general, you want to create a separate source for your application, and unique event IDs for each distinct event – this allows the operations staff to create customized rules for alerting on specific events (they might decide that some events are worth waking people up at 3am for, and others can wait for business hours). Also note that in general, you only want to write events that need attention by the operations staff (don’t flood them with more routine functional errors). That suggests that you want to be able to make a distinction between kinds of errors based on the audience that needs to see them – more on this below).

For the customer-oriented IT staff, and for the developers, information could be written to tables in the database, or text files. While there are benefits for each, my own preference is for plain old text files. They are easy to purge, easy to compress, easy to send through email, and there are an assortment of tools for searching and filtering them (remember, grep isn't just for Unix weenies, and even Excel can be used effectively for sorting and filtering). Two recommendations on text files: use an infrastructure that allows you to roll the files over based on dates and/or size (to keep them from growing indefinitely), and give the customer the option to place them on a non-default drive (some environments prohibit writing log files to the C drive, because of concerns they could consume all of the disk space on a critical drive.) In a Microsoft SharePoint Online Dedicated environment, you have to write to ULS logs, which has some advantages (automatic space management, and some nice viewing applications) but some disadvantages (they can be VERY noisy and large…)

OK, what should you put into the log entries for the IT staff and yourself

First of all, you want to give yourself the ability to select the amount and type of detail to give you and support and the customer the freedom to deal with different situations without overwhelming *everyone* with too much information. By default, you normally will capture only errors, but for troubleshooting situations, both SharePoint itself and ControlPoint adopted a two-dimensional model of settings – the first dimension selects the functional area of the application that needs investigation (e.g. active directory interactions, or dealing with metadata, or database interactions), and the second dimension chooses the level of detail (For ControlPoint, we supported None, Error, Warning, Information, Verbose – I would suggest sub-dividing the Error into two categories (Infrastructure Error, and Application Error – this lets you distinguish between the messages that need to go to the event viewer for consumption by the operations team (can’t connect to the SQL server), from the errors that merely elaborate on a report that didn’t run to completion. The important distinction here is who needs to see the error (and therefore where it needs to be written). So, the customer could choose to write infrastructure errors to the event log, but not application errors.

Finally, when you write an entry to the log file, what does it need to include? The underlying principle is “whatever you’re going to need to sort out the problem”. (Of course, when you’re writing the code, you don’t know what the problem will be, or you’d fix it before it happened…)

I would include at least the following:

Date and Time (use the server’s time, not UTC – it will be a lot easier to find entries)

If it is possible, what action the user was engaged in (this is important for two reasons – first, the user many not even remember what s/he was doing, and many different users might be using the system at the same time).

If it is possible, the name of the user (again, if multiple users are using the system, this can be useful to distinguish one users’s activity from another’s)

If multiple processes may be in use, the process ID, and if multiple threads may be used, the thread ID (similar to the user ID, this can be useful to distinguish one user’s activity from another)

As noted above, a unique code for each application-level message (error, warning, information, etc.) – this is important both for external monitoring tools, but can be useful for filtering entries while analyzing a log file.

When an error is returned or an exception thrown from an API call, the error number and/or the message returned.

The stack trace (this can tell you a lot about what the application was doing at the time of the error.)

When appropriate, the actual data being passed to the API or other method. (So, for example, instead of merely reporting an error trying to instantiate a site, list or item object, provide the exact URL of the site, list or item – this can help to expose either subtle logic flaws, or configuration errors. Let’s say that the user configured a URL without a trailing slash when it was needed – the absence of the slash could show up in the data passed to the SPSite constructor. Or if an active directory group name is shown as a long number, it could expose the fact that the code failed to translate the group’s SID into a name.)

Wednesday, November 6, 2013

Granularity

In my last post, one of the recommendations I made was to "Keep on Truckin'", in other words keep the application running and accomplishing as much as can be done, even in the face of failures and bad inputs in the environment. In this article, I want to explore that a bit more, and specifically talk about the granularity of the response to errors.

But first, a side observation: Exception handing is one of the more wonderful things that Bjarne Stroustrup incorporated into C++ and that were adopted by derivative languages. While there is an overhead associated with Try..catch blocks, I would argue that the value of extensive and fine-grained use of exception handlers is well worth the relatively small cost. Fine-grained exception handling is a good thing.

Which brings us back to granularity: my advice is to trap, handle, and recover from errors in the environment at the smallest practicable granularity. So, what does that mean? Of course it is hard to say in general ("practical" is so subjective!), but I would say that the trapping of errors and/or the location of the try...catch blocks should be at least at the level of the lowest-level item included explicitly in an analysis, or acted upon by an action.

Some examples might help:

In Outlook, suppose that I've got a .pst file for my email from the 1980's (of course, I converted it from my ccMail client...), and I'm searching for email from Bill Gates. Suppose the report of the standardization committee on the Basic language was somehow corrupted. If the exception handling is at the level of the .pst file, I might not find any messages from Bill, but if the exception handling is at the level of the item, I'd miss the committee report, but I might see the "Thanks for your valuable suggestion" (No, I never got one of those, but if I did, you can be sure I'd hang onto it :-)
In ControlPoint, we implemented a Comprehensive Permissions Report that can report on permissions right down to the individual list item level - in this case, if we are unable to read the permissions on an individual item, we do our best to report on the permissions of the site, the lists, and any other items that we are able to read.

Note that I've expressed this in terms of exception handling, but remember that we still have API methods that report information in return codes, and/or in return values (as in, a null object pointer may be a null result) - which brings me to a pet peeve: there is no good reason to have an "object reference not set to an instance of an object" error - if your code reports that error, you probably weren't checking a return value. And a recommendation to code reviewers: this is one of the things you should check for.

OK, so you catch an exception or detect an null object reference at an appropriate level - now what do you do? It is worth keeping in mind that you actually have three audiences that need to know something about what happened: the end user who initiated the action, the IT staff that needs to know something went wrong, and the developer and/or support team responsible for the application - each of those wants something slightly different:
User

Of course, you owe it to the user to tell him/her that something went wrong, and that the report s/he asked for is incomplete or that an action wasn't carried out as fully as the user expected. If the report or action has a status message, a completion message like "Completed with errors" is a good balance - you tell the user that results are available, but they are not 100%...

But how do you communicate what is missing, e.g. which sites are not included in the tally of storage, which sites are missing the list of running workflows, etc. For reports, the best solution is to either replace the object name with a string like *** Error *** if you are unable to fetch the name of the object, or append a string to the name if you have the name, but not an important property value. Note that you want to make sure that the string is searchable - in a 3,000 page report, you can't rely on manual inspection to pick out the items with the errors - help the computer to do what computers do best!

In actions, if the action has some form of task audit report (always a good idea!), then simply list the objects that weren't processed because of errors in the task audit report

IT Staff / Developer / Support

While a lot of technical detail will cause most end users to glaze over, most IT staff can tolerate technical detail, even if they don't fully understand it. In other words, you can probably combine the information you want to communicate to the IT staff with the information you want to communicate back to yourself (as a quick preview of a later post - remember, what happens if you take a support call: *you* are one of the audiences you're writing this information for!).

We'll talk about what information to capture and where in a later post, but for now, think about the information that the IT staff needs - I will mention a few things to be sure you include:

What action was the user in the middle of (remember, the log file may contain information about a lot of different user's activities - clarifying the action can help to correlate the log entry to the user's action)
What object did the error occur on? (If the user ran a report on the entire farm, s/he won't know which site was the corrupt one - helping the IT manager find that site can enable them to do something about it.
The message returned from the API - often this contains exactly the information that the IT person needs to know. (E.g. to illustrate with a common problem - if the password changes on a service account, often the error message will tell the IT person exactly what the problem is (maybe even without calling vendor support!)

Note that there is a real dichotomy between the information you present to the user, and the information needed by IT - a couple of notes here:

There are a small number of customers in highly secure environments (curiously, these tend to be in financial industries, rather then military or security agencies) for whom display of API-generated messages to end users is a security risk (since the API message may expose information about the system). You may need to accommodate those users. On the other hand, displaying messages from the API directly to users can often help debugging, both in QA, and when support is online with the customer and can watch what is happening on the screen, so it is very tempting. In ControlPoint, we introduced a customer-settable option that specified whether error details should be displayed to the user.
If you want to display detailed technical information to the user, it would be best to prefix it with something like "Please provide the following information to your help desk or IT department" - that gives the end user permission to not to try to understand it.
While it can be useful to display information about what source data gave rise to the error (e.g. the name of the site that the report couldn't report on), be careful that you don't inadvertently expose information. If the user asked for the storage consumed by the HR site collection, telling the user that the site \HR\Layoffs\AlfredENeumann is inaccessible might expose information the user shouldn't know.
In this vein, the model adopted by SharePoint starting in version 2010 of presenting the user with a sparse message containing a correlation ID works very nicely - the message seen by the user has very little information (maybe even too little), but the correlation ID allows the IT department to find the exact error message in the ULS logs - the only thing they missed was to warn the user that they'd actually need the correlation ID before they dismiss the dialog...

Monday, November 4, 2013

Lessons from medical device software

A number of years ago, I worked with someone whose husband had an automatic, implantable defibrillator to address heart problems.

As a software engineer, my first reaction was "Wow - they had to get that exactly right". As a Software Development manager, my second reaction was to think about what lessons other less critical software projects could take from that. There are at least three that come to mind:

1) Some things need more care than others.

That lesson can be applied at multiple levels - first, some applications are really critical (the automatic defibrillator could kill you if it fails; the navigation system for the Space Shuttle had to be exactly right or the astronauts wouldn't get home again...), but in contrast, a report on what class of users prefer Kleenex over the store brand isn't going to kill anyone if it is off by 10%.

At another level, within an application, some functions are more critical than others. Windows XP used to (I guess it still does for anyone who's running it...) contain a neat defragmenter that started with a very colorful diagram showing you how badly fragmented the disk is, and then proceeded to move sectors around on disk to reduce the fragmentation. Now, if the report was wrong, frankly, most of us would never know, and probably wouldn't care a lot. On the other hand if it lost a sector on disk during the defragmentation, that could corrupt a really critical file (my status report to my bosss, the grant application for my foundation, my resume...). My hope and expectation is that the team creating that defragmenter spent a lot more time and care on the design, code reviews and testing of the defragmenation aspects of the program than they did on the display.

2) Keep on Truckin'

(Google it, if you're not familiar with the phrase...)

Back to the defibrillator - if it encounters an error (lets say the battery level drops momentarily, or the user gets too close to a microwave that temporarily messes up the sensor that detects the heart's contractions), the defibrillator can't exactly display "Error, unexpected input, please restart" - it has to ignore the spurious input, keep running, and wait for the error condition to clear itself (and if possible, clear the condition).

So how does that apply to those of us who are more focused on enterprise software. It turns out that this is one of the hardest lessons for most application developers - the "Error, unexpected input" response is our first reaction, and it does have the advantage of alerting the user to a problem. But for highly available software (that is what you want to build, isn't it?) that's not the best response.

Consider a mail server; while it normally isn't as life-critical as a defibrillator, mail has become a business critical function in today's world. If your mail server encounters an SMTP protocol error when Joe's Garage is trying to send a "your car is ready" message, you don't want the mail server to shut down and wait for Joe to get his act together. Instead, you want it to continue to listen for other email (the announcement that your company just won that million dollar deal, for instance. or the support question from a customer). If the disk defragmenter encounters a bad sector on a write, you'd like it to mark it as bad, and try another one; or if it encounters a bad sector on a read, you want it to put as much of the file together as it can. If the SharePoint search indexer encounters a corrupted site, you want it to continue to crawl all of the other sites, so that you can search for *most* of the content of the farm.

Now, this needs a bit of care, which begins to overlap with my next lesson. Consider the disk defragmenter - if it gets 100 bad sectors in a row, that's probably an indication that the disk or the controller has gone bad in a big way - continuing to try to defragment the disk could very possibly make things a lot worse. If the mail server gets 20 errors in a row trying to send a message to Joe's Garage ("I'll pick it up at 6:00 pm"), it's probably a waste of time trying the 21st time.

In short, if your "Keep on Truckin'" response is to retry an operation, you normally want an upper bound on the retries, or you could spend more time on the retries than on productive work, which in the end would defeat the "Keep on Truckin'" objective. While your at it, let your customer override your default number of retries (because someone is going to want a different threshold than the rest of your customers.)

3) "First do no harm"

(While those words are not actually in the Hippocratic oath, the spirit of it is there...) In more mundane terms, we need to consider the failure mode of our software.

Continuing with the example of the defibrillator - consider what the defibrillator should do when the battery level begins to drop or even if there is a power surge - you don't want the defibrillator to go into defib mode just because the inputs are out of range - you'd rather it go into quiet mode until things become normal.

So, what does this mean for enterprise software? I'll use an example from Axceler's ControlPoint - the product includes a function to move SharePoint sites - it does this by exporting a copy of the site, importing that copy to the new location, and then as an optionally delayed operation, deleting the source site. First of all, if the import of the copy of the file to the destination has an error, we do not want to delete the source file. Similarly, the Move function has an option to take a separate backup before the move - if the backup has an error, then again, we do not want to delete the source file. In all cases, the response to errors (the failure mode) is to do nothing. While this is fairly obvious when stated this way, it can be easy to miss if you don't think about it while building the function.

Finally, at the functional design level, there are some less obvious things that can be done. We implemented the delayed delete of the source partly to provide a validation and recovery option: if a problem occured with the destination, even a subtle one that the code could not detect, then the user has the the time to notice, and has the option to cancel the delete and revert back to the original site.

Who am I, and why this blog

I've been building software, generally enterprise-level COTS (Commercial Off The Shelf) software for longer than I care to admit. I've done it in some very large organizations (Wang Laboratories, and Digital Equipment Corporation) and some very small, 20 person organizations. I've had the chance to work with some brilliant people who showed me what to do right, and some folks who showed me things NOT to do. As I have tried to impart that learning to my teams, I tend to fall back on two things: stories and metaphors (and in the end, they're not all that different).

So this blog is an opportunity to capture some of those stories, and occasionally metaphors, in the hopes that some folks will find it useful.