Stories: Heisenbugs

Before Heisenberg started cooking meth, his namesake Heisenberg was a physicist known for the Uncertainty Principle, which says that you cannot know both the location and the momentum of a sub-atomic particle. This is related to the scientific principle called the Observer Effect, which states that the process of observing a system changes the system under observation. (This holds for systems ranging from sub-atomic particles to sociological studies to reality TV).

So, what does this have to do with software? While the name Heisenbug is very tongue-in-cheek, it describes a very real problem that occurs occasionally, and is actually related to the observer effect: a Heisenbug is a bug that disappears as soon as soon as you start to analyze it. (In my mind, it is the second most annoying type of bug after the type that occurs only on premises at the million dollar customer who is so secure that they cannot even send log files.)

Some examples of Heisenbugs that I've seen:

A number of years ago at Digital Equipment Corporation we created a highly scalable, heavily multi-threaded mail server that supported both x.400 and SMTP/Mime. Somewhere in the code was a subtle threading problem that we presume had to do with locking between threads - it generated an error at only one customer (a major Air Force base). So, to find itm, we did what any software engineering group does - we added a lot of logging to tell us exactly where the error was occurring, and what lead up to it. The problem went away, we presume because the logging changed the timing of the locking just enough to avoid the race condition. After months of trying to fine tune the logging to get the information we needed without changing the behavior, we gave up and just ran the system with the logging enabled, and once a day asked the engineer on site to delete the log files...
More recently, at Axcler, we were fighting with a problem at one of those million-dollar customers and it was the only place the problem manifested itself. Again, we added more logging (since there was no option of running a debugger on site...), and again, the problem disappeared. In this case, we conjecture that the observer effect came from the fact that the debugging code, which was dumping out information about the SarePoint farm, may have initialized or reset the state of the SharePoint API, thus providing a valid return on something that the core application logic needed.

So, what can we do about Hiesenbugs? Unfortunately, there often is no good answer. Often the solution is an example of the messy side of software engineering in practice. More often than we would like, we have bits of logic in our production systems that are there just because they work and we don't always know exactly why. While we look at them and say "yuck" (or to use a different metaphor, they "smell"), the cost to figure out what is really going on may not make economic sense (do you want to satisfy your scientific curiosity, or would you rather implement a new feature?). Or if the problem occurs only at a customer site, the process of figuring out why may be just too annoying to the customer. Just as doctors will sometimes just treat a rash with cortisone without fully knowing what caused it, software engineers occasionally treat the symptom without explaining the cause...

Stories

Monday, November 25, 2013

Heisenbugs

No comments:

Post a Comment