So, what does this have to do with software? While the name Heisenbug is very tongue-in-cheek, it describes a very real problem that occurs occasionally, and is actually related to the observer effect: a Heisenbug is a bug that disappears as soon as soon as you start to analyze it. (In my mind, it is the second most annoying type of bug after the type that occurs only on premises at the million dollar customer who is so secure that they cannot even send log files.)
Some examples of Heisenbugs that I've seen:
- A number of years ago at Digital Equipment Corporation we created a highly scalable, heavily multi-threaded mail server that supported both x.400 and SMTP/Mime. Somewhere in the code was a subtle threading problem that we presume had to do with locking between threads - it generated an error at only one customer (a major Air Force base). So, to find itm, we did what any software engineering group does - we added a lot of logging to tell us exactly where the error was occurring, and what lead up to it. The problem went away, we presume because the logging changed the timing of the locking just enough to avoid the race condition. After months of trying to fine tune the logging to get the information we needed without changing the behavior, we gave up and just ran the system with the logging enabled, and once a day asked the engineer on site to delete the log files...
- More recently, at Axcler, we were fighting with a problem at one of those million-dollar customers and it was the only place the problem manifested itself. Again, we added more logging (since there was no option of running a debugger on site...), and again, the problem disappeared. In this case, we conjecture that the observer effect came from the fact that the debugging code, which was dumping out information about the SarePoint farm, may have initialized or reset the state of the SharePoint API, thus providing a valid return on something that the core application logic needed.
No comments:
Post a Comment