Stories

Fault Intolerance

2014-11-02T17:38:00.000-08:00

I recently read an article titled "Fault Intolerance" by Gerard Holzman, in the November issue of the IEEE magazine "Software" - the premise of the article is that more than software that tolerates faults, we need to create software that is INtolerant of faults and aggressively defends itself against faults.

On the way to his premise, he spoke about the push within academia to create software that can be formally proven to successfully satisfy the requirements of a formal specification. Now, I've always been a bit skeptical of the practicality of this objective, but Gerard gave me a bit of ammunition in this with the following observation:

"The most difficult part of a formal specification, and the part that's most often incomplete, is to state clearly what should happen under various error conditions... Especially for complex systems, it can be very hard to forsee all possible off-normal events that a system might encounter in real life...Reality can be remarkably creative when it comes to showing where our imagination falls short" (And I would add end-users to the list...:-)

So, where does that bring us? "Fault intolerance leads to a defensive posture in software design in which nothing is taken for granted."

And finally, he lays down the gauntlet with a standard:

"For well written code, we should expect to see an assertion density of at least 2 percent, meaning that two out of every 100 executable statements are assertions"

The example the author uses throughout the article is software designed to operate semi-autonomous spacecraft such as the Cassini spacecraft (orbiting Saturn) and the Voyager spacecraft (which recently entered interstellar space). Now, most of us are not creating software for such demanding environments, and cannot afford the kind of double and triple redundancy that made these craft so reliable (the stamina of the Voyager is mind-boggling - to continue to operate with no physical modification for 37 years! What earth-bound computer is still operating after 37 years?).

There was, however, a principle that I'd like to highlight and use as a metaphor: most of these systems are designed so that if a truly unique or unhandleable condition occurs, they revert to a safe mode - generally, this means to make sure the solar cells are pointed to the sun so that the batteries can be charged, and the antenna is pointed to earth so that they can get further instructions. It is important to note that the system does not just shut down - it continues to operate in a minimal way that allows for recovery.

So, what does all of this mean to those of us building commercial systems? First, I will argue that the focus of fault intolerance, and such guidelines as an assertion density of 2 percent, still apply. While we may not be dealing with a billion-dollar spacecraft, we are dealing with demanding users who are trying to use our software to get a job done. And while some of the errors we have to deal with come from those very users, many of them come from random events in the universe (gamma radiation, sun spots, quirky configurations of an ancient single-sign-on service that interferes with our perfect software.) Our users expect our software to continue to operate, even if we have to go into safe mode to do so.

So, what does safe mode mean for a commercial system? First, to the extent possible, continue to operate in degraded mode. Second, when you detect an extreme unexpected input or situation, go into safe mode gracefully and safely (e.g. don't write random data to the most critical database in your system - you want to be able to continue to operate after the error conditon is corrected). And third, continue to operate at least enough to tell the user that something is wrong (and in fact, tell them to contact their system administrator, who might not yet be aware of the problem.)

Simplicity, or the art of NOT doing things

2014-03-10T13:27:00.001-07:00

I happen to be particularly enamored of simplicity in art: the uncluttered style of oriental graphic art, the enigma of a Koan, the spare beauty of Haiku and the purity of a cappella voices.

Carrying that aesthetic to our chosen profession, we find an increasing focus on simplicity in software engineering. One of the core principles that is part of the Agile Manifesto reads “Simplicity – the art of maximizing the amount of work not done – is essential”. The lean movement puts additional emphasis on this. Finally, the admonition YAGNI (“You Aren’t Going to Need It”) is at its core an appeal for simplicity.

So, why do we need all of this attention on a simple principle? Because Simplicity is HARD and doesn’t come naturally to most western minds.

As software engineers, we’re paid to write code and we’re good at it, so our instincts (and/or our habits) are to write more of what we’re good at.
Our product owners have very long lists of features and requirements that will make our products richer and more competitive.

But the cost of complexity is insidious (and I’m focused on both aspects of the word: both evil and not easily noticed):

More code paths (one of the measures of complexity) represent more unit tests, more test cases, and more combinations and permutations.
More code paths result in higher entropy: that conditional execution that is so minor and obvious today will not be so obvious and will be forgotten 2 years from now when you’re doing maintenance. The additional code paths greatly increase the chance that the bug fix or enhancement 2 years hence will break something
When complexity finds its way into the user interface, it increases the power and flexibility of the application, but at the expense of usability. At best, the users grumble a bit, at worst, they stop using the product.

So, let me offer a few recommendations:

At the user interface, find a way to collapse a large number of high-level choices into a smaller number. For example, many text processing applications will have menu items for Search, Global Search, Find in Files, Replace, Replace Globally. In the end, those can all be handled by one Find function, keeping the top-level set of choices much smaller and more approachable.

In Axceler’s ControlPoint, we implemented Site-level permissions reports sorted by Site, Site-level permissions reports sorted by User, Comprehensive (detailed) permissions reports sorted by Site, and Comprehensive (detailed) permissions reports sorted by User. With the clarity of hindsight, these would have benefitted from collapsing these into a single high-level Permissions Report, with a sort option, and a level of detail option. (Now, a reality check on this: at some point, pushing too many choices down into detailed options on a function makes that function suffer from the complexity bug, so don’t carry it too far! However, at the more detailed level, you have more choices to manage complexity through techniques such as “Advanced” options so you only see the complexity when you need or want it).

When considering boundary conditions, error conditions, and special cases (these have a way of surfacing in bug reports, so the discussion tends to happen in that context), it is tempting to create special validation or special handling for each unique situation. Now, in some cases, that is appropriate, but you should resist the temptation: often the long term compounding effect of lots of little bits of complexity is greater than the incremental benefit of a uniquely tailored message or handler. My last blog post actually provides a good example of this. Apple chose to handle both “too hot” and “too cold” conditions in the same boundary condition handler. I believe that this was an excellent choice that kept the logic simpler. The only shortcoming (and a minor one at that) was that the message verbiage only acknowledged the most likely of the two possible conditions!

In closing, I’d like to quote a man who bridged the gap between art and technology (namely Leonardo daVinci) “Simplicity is the ultimate sophistication”.

Error messages that don’t make sense

2014-01-25T08:46:00.001-08:00

This time just an amusing observation…

As software engineers, we all struggle with the challenge of writing error messages that are meaningful and helpful, and from time to time, we all stumble into amusing blunders. Even the king of good user interfaces, Apple, can fall victim:

I have an old iPhone 3 that I’ve been using essentially as an iPod, and I left it in my car. One cold, sub-freezing morning recently I tried to start the iPod up, only to get the message:

Temperature

iPhone needs to cool down

before you can use it.

Of course, I ignored the instructions and warmed it up, and the message went away J

(I never knew the iPhone had a temperature sensor in it – apparently it’s not available to app developers…)

How to approach support issues

2013-12-28T19:34:00.000-08:00

Note that this post addresses only some of the issues that arise in large, complex, and long-running support incidents; it generally focuses on the process more than the specific content of a particular issue. From that process perspective, I’ve seen support issues struggle with three broad aspects of the issue:

You get a large amount of information, and you don’t know what bits of information are pertinent.
You don’t get one or several bits of information that you asked the customer for.
The incident drags on interminably, and never seems to get to closure.

Too much information

At one level, you’d rather have too much information than too little. That’s why, when generating log files, we tend to turn on every class of tracing that could possibly have a bearing on the issue. For now, I won’t talk about the process of extracting meaning from huge and noisy log files (there’s a recorded session on that in the training directory at Axceler/Metalogix). Rather, I’m talking about elements of behavior or the environment that might or might not be significant. Two examples:

We were working on a problem with a job that worked properly when run interactively, but not when run as a scheduled job. We noticed that every time the scheduler woke up, it would run through the “is there anything waiting to run” assessment several times. We weren’t sure why, and didn’t know whether to first on trying to figure that out, or start with other aspects of the problem.
Another customer was having uncharacteristically slow performance in a task that looked for permissions granted to users who were no longer in Active Directory. In the course of discussions, it became clear that the customer had many hundreds of active directory domains. We weren’t sure whether that might be related, and an incidental observation.

The art of support is figuring out what bits of the environment, or which unexpected behaviors are worth spending time to explore, and which should be ignored. As you make that judgment, consider a couple of issues:

Initially, think about what makes logical sense. Using the first example above, do you expect that having the scheduler run several times should have anything to do with what looked like a permission problem? If the behavior or element of the environment doesn’t feel like it should be related, then temporarily ignore it and look elsewhere. (But of course, be prepared to come back to this if other factors get ruled out – remember Sherlock Holme’s aphorism: “Once you have eliminated everything that is impossible, whatever is left, however improbable, has to be the explanation”). [ Historical note: we decided that the multiple runs of the scheduler was not the source of the problem. In the end, we determined that it was a relatively low impact configuration problem that had no impact on the significant issue we were trying to solve. ]

Also, ask yourself, if that behavior had anything to do with the problem you are exploring, would you have seen any other symptoms? In the second example, if the number of active directory domains had anything to do with the performance of the problem report, would we have seen this at other customers, or would we see some kind of impact on other reports at this customer. Often this line of reasoning can lead to other questions (e.g. Do we have any other customers with hundreds of Active Directory Domains? How long does it take to display the members of an active directory group in a different domain? Perhaps we should turn on the logging for the code that accesses active directory.) [ Historical note: the number of domains turned out to be the issue – connecting to one domain controller has a modest cost, but multiplied by 300 it adds up significantly. ]

There are two big risks for this line of reasoning:

Beware of asking the customer too many questions – in a long-running support call, you run the risk of exhausting the customer’s good will – save that good will for the most important questions.
Also, as noted below, be very careful to notice whether you are going down a rat hole.

You don’t get all of the information you ask for

The first recommendation is to make sure you really need the information you asked for. Note that in some cases, the customer may not have answered as a simple oversight, but in many cases, the customer has not answered because they don’t know the answer and/or don’t know how to get the answer. Insisting on the answer to a question can use up good will, so before asking again, think about whether it was important or whether it was an “if you happen to know…” question.

If it was important, then ask the question again, but perhaps try to phrase it differently (maybe they didn’t understand the question) and/or tell them how to answer (check this setting, look in this log file, send me a screen shot of this screen…)

Sometimes explaining why you need the information can motivate them to spending a bit of extra effort to find the answer. Note that to some extent, explaining why you are asking will help to engage the customer, help to build a relationship and reduce the erosion of good will.

Finally, sometimes it can help to explain what you are looking for in general terms, and then ask them if they know anything that would shed any light on the problem. For example, in the first example above, we could ask “Can you tell if the timer job itself is running repeatedly”, “Are you aware of any other timer jobs that run repeatedly”. (Note that this approach tends to be a long-shot, but every now and then, a Hail Mary pass connects and wins the ball game J )

The incident drags on and doesn’t reach a conclusion.

First, let me note that when you’re in the middle of a long-running support call, you tend to be focused on the last answer you got, and the next thing to try – it can be very hard to notice that the call has gotten bogged down. This is where it is very good if the support team has an automatic rule for escalating issues that remained open for some period of time – use that escalation to think about whether you should be doing something differently.

Second, note that it is VERY easy for a long-running support case to go down a rat-hole without even noticing it – you ask for a bit of information that would be nice, but not essential, and if the customer has a problem getting that, then finding that tidbit of information can turn into a project of its own, even though it may not be the most important expenditure of time, effort and customer’s good will.

So, when a case drags on, periodically, it is useful to regroup. Do this at least figuratively, but very often literally bringing the players together in a room. In that regrouping, ask the following questions (and write the answers down):

What do we know (include in this things that you think are unimportant, but you’re not 100% sure of the significance of.)
What don’t we know that we wish we did (and don’t limit this to things you think you can get the answer to – as you brainstorm, someone may think of a way to answer the question.)
What have we tried (or questions we’ve asked) and what were the results/answers. (Among other things you want to avoid going over the same ground again with the customer; this step is particularly important if new people have joined the incident team
What haven’t we tried yet (in a sense, this is a variation of what we don’t know…)

Treat this process of collecting information as a sort of brainstorming session. Part of the value of this is to notice:

Things that might be significant, that you hadn’t seen before
Things that you haven’t tried yet (or questions you haven’t asked), but should.
That you have gone down a rathole, or are focused on an aspect of the problem that isn’t the most fruitful avenue.

BTW: at the end of this, you may be inclined to ask the customer a question you’ve asked before, either because you didn’t trust the answer, or because the answer wasn’t complete. As mentioned above, you should avoid this if possible, as it tends to use up good will, but if it is necessary, it can help if you acknowledge you’re repeating the question: “I know we asked this before, but we want to double-check that we heard correctly”, or “... we wanted to double check one aspect of …”

How do we think?

2013-12-02T08:02:00.001-08:00

No, I'm not talking about how our neurons communicate. Nor am I talking about what we think about (there are plenty of blogs on that topic). Rather, I want to discuss the mental process of moving from an elementary idea or an ill-understood problem to a well-structured proposal or solution. Note that I think that we can and should use a very similar process in at least two somewhat different circumstances:

Elaborating on a very high-level, briefly expressed, requirement into a more complete definition that is sufficient for implementation.

Understanding the details and underpinnings of a bug sufficiently that you can define what to do about it.

Even though this blog is directed at engineers, I suggest that we would do well to keep the principles of the scientific method in mind. What do I mean by that? Well, when we look at the incredible simplicity and beauty of formulas like F=ma, E=IR, E=mc**2, it is very tempting to think that these laws sprang fully formed from the brains of the authors. As brilliant as Newton, Ohm, and Einstein were, the reality is much messier than that. First of all, in the broad sweep of history, it took generations to get past the idea of Earth, Air, Fire, and Water to get to the point where we could even contemplate F=ma. (Newton said, "if I have seen far, it is because I stood on the shoulders of giants"). At the individual level, even those geniuses took many years to fully understand what was going on and to fully distill that understanding to a simple and coherent statement of truth.

So, my first message: give yourself permission to do what Newton and Einstein did - i.e. give yourself permission not to completely understand the problem at the beginning of your exploration.

OK, so what does that mean in practice? Just as most modern software development life cycle methodologies build upon the principle of iterative refinement, our thinking process should do the same. Don't try to come to a conclusion too quickly - give yourself permission to more completely understand the problem space, and to get an idea of what is going on and what you even need to observe, before you try to formulate a rigorous experiment and/or to formulate a coherent proposal.

In the language of science and mathematics, you need to do some somewhat more unstructured playing and observing before you can:

Design the experiment (determine what steps/operations you need to perform).

Decide what the independent variables are (these are the things that you want to vary in the process of the experiment - you are trying to determine what the impact of these variables is).

Decide what the dependent variables are (these are the things that you want to make careful observations of during the experiment.

In the end, come to a conclusion about causes and effects and what to do about them.

I've been a bit abstract so far, so let me give some examples, which I will draw from ControlPoint development.

Analyzing a bug

We had 2 or 3 customers complain that ControlPoint was marking many (but not all) of the SharePoint users as having logged on recently, even though they had not. (As an interesting aside - Active Directory maintains this information, and these customers were using this to identify inactive accounts that might need to be shut down – ControlPoint’s behavior was interfering with that...)

Of course, the first step was to confirm that the application really is causing this (that it wasn't something else in the environment - cause and effect can be a slippery thing at times.). We were able to recreate the problem, although not consistently, so we were comfortable that we were at least triggering it. However we didn't yet know why it happened on one server but not another.

So, the next question was whether the application was actually doing this explicitly. Logically, this seemed unlikely - doing this explicitly would mean that we would need to have the credentials for all of those accounts, which we didn't have. And, empirically, we did not find any code that obviously did a logon.

This meant that this was probably a side effect of something else we were doing, but what was that (and was there anything we could do about it?)

At about this time we realized (partly through logic and partly through observation) that anytime that active directory recorded that the user had logged in recently, the event viewer recorded a login event in the security section. This gave us a much more useful (in the sense of being more immediate) way of observing the effect - we had now identified our dependent variable for experimentation. (Note that at this point, we are still exploring, and aren't ready for a rigorous experiment yet, because we haven't identified the independent variables, and don't yet have a hypothesis to test.). This allowed us to confirm something we suspected, that the problem was occurring somewhere in our nightly discovery process. But what part of discovery, and what specific action was causing the effect?

We observed that from the perspective of ControlPoint, there are a number of different user types, that we do different things with - there is the user who is running ControlPoint, there are users who are authorized to run ControlPoint, as a subclass of those, there are business admins, and finally, there are the ordinary users of SharePoint who have been given rights to SharePoint (but not ControlPoint). So, we formed the hypothesis that the type of user may have been what was different between cases where we observed the problem and didn't observe the problem - we now had the independent variable for our experiment, and we were now ready to graduate from relatively unstructured experimenting and observation to create a carefully structured experiment.

The experiment took the following shape: create brand new accounts (since in SharePoint existing accounts can be treated somewhat differently by SharePoint) with the following characteristics:

Farm admin who is also a ControlPoint (ordinary) admin

Farm admin who is also a ControlPoint business admin

Site Collection admin who is a ControlPoint (ordinary) admin

Site Collection admin who is a ControlPoint business admin

User with Full Control who is a ControlPoint business admin

User with Full Control who has no rights to ControlPoint

Having distinct and separate accounts, allowed us to clearly identify the impact of the type of account – in other words, the account type was an independent variable.

We ran discovery, and then observed which accounts triggered a login event in the event viewer log. In the end, this allowed us to determine that it was any user with rights to use ControlPoint, which in turn allowed us to narrow it down to a WindowsIdentity system call, which we were using to determine what groups the user was in so that we could determine what rights that user might be getting from those groups. (It appears that the operating system is doing an implicit impersonation of the user, which in turn amounts to a login...)

Armed with that knowledge, we were able to come up a different mechanism to get the list of groups, and thereby avoid the impersonation/login for each of the users.

Note that the process above unfolded over the course of a couple of weeks – clarity does not come in a flash (even Archimedes’ moment of “Eureka” followed a lot of thinking!)

As a side note: there is another non-technical process that you should use here: give the problem some thought, explore the details and the alternatives, and then intentionally set it aside, ideally at least overnight. Your brain has a remarkable background processor that works on problems while your attention is elsewhere – when you come back to the problem, it is often a lot clearer than when you set it aside.

Responding to requirements

Consider the requirement “We need to be able to duplicate workflows”. As developers, our first objective is to elaborate this into enough detail that we can know what we need to build, that we can ensure that we have the same idea of what is needed as the product owner, and that we can fairly accurately estimate the effort for this. Normally, that means that we need to understand workflows enough to give the product owner some useful background, to ask intelligent targeting questions, and ultimately to propose a set of functionality that delivers useful value to the customer (the goal of agile, of course) while providing value to the company.

Unless you are already an expert in SharePoint workflow, getting to that understanding requires some time exploring, reading, and experimenting – the result of that may be the understanding that the following factors affect the duplicate functionality:

Version of SharePoint (2007, 2010, 2013), and compatibility among versions (e.g. 2013 supports 2010 style workflows, but also supports an entirely new workflow architecture)

Types of workflow (Out of the Box, individually defined, reusable)

The elements of the workflow, i.e. the definition, the association with a particular library, the instances of the workflow, the history list, the task list

Versioning of the workflow definition (i.e. an older instance might still be running with an older version of the workflow rules, and a newer instance running with a revised set of workflow rules)

Given this increasing understanding of workflows, this exercise is not going to culminate in a hypothesis and a rigorous experiment, but it does lead to a step of increasing rigor in understanding and questions. So, at this point, we are ready to more completely articulate what resources/artifacts are involved in a workflow, and what resources/artifacts are shared among different workflows, and therefore need consideration when duplicating workflows. We’re now prepared to have a discussion with the product owner of whether the initial implementation might be limited to 2010 and 2007 style workflows. And we can think about what it means to duplicate a workflow that involves resources that are shared with other workflows.

What is common among those examples?

Recognizing that you won’t always understand the problem space up front

Recognizing that your understanding will improve iteratively

Setting a goal of increasing the precision of your understanding with each iteration.

Heisenbugs

2013-11-25T03:53:00.003-08:00

Before Heisenberg started cooking meth, his namesake Heisenberg was a physicist known for the Uncertainty Principle, which says that you cannot know both the location and the momentum of a sub-atomic particle. This is related to the scientific principle called the Observer Effect, which states that the process of observing a system changes the system under observation. (This holds for systems ranging from sub-atomic particles to sociological studies to reality TV).

So, what does this have to do with software? While the name Heisenbug is very tongue-in-cheek, it describes a very real problem that occurs occasionally, and is actually related to the observer effect: a Heisenbug is a bug that disappears as soon as soon as you start to analyze it. (In my mind, it is the second most annoying type of bug after the type that occurs only on premises at the million dollar customer who is so secure that they cannot even send log files.)

Some examples of Heisenbugs that I've seen:

A number of years ago at Digital Equipment Corporation we created a highly scalable, heavily multi-threaded mail server that supported both x.400 and SMTP/Mime. Somewhere in the code was a subtle threading problem that we presume had to do with locking between threads - it generated an error at only one customer (a major Air Force base). So, to find itm, we did what any software engineering group does - we added a lot of logging to tell us exactly where the error was occurring, and what lead up to it. The problem went away, we presume because the logging changed the timing of the locking just enough to avoid the race condition. After months of trying to fine tune the logging to get the information we needed without changing the behavior, we gave up and just ran the system with the logging enabled, and once a day asked the engineer on site to delete the log files...
More recently, at Axcler, we were fighting with a problem at one of those million-dollar customers and it was the only place the problem manifested itself. Again, we added more logging (since there was no option of running a debugger on site...), and again, the problem disappeared. In this case, we conjecture that the observer effect came from the fact that the debugging code, which was dumping out information about the SarePoint farm, may have initialized or reset the state of the SharePoint API, thus providing a valid return on something that the core application logic needed.

So, what can we do about Hiesenbugs? Unfortunately, there often is no good answer. Often the solution is an example of the messy side of software engineering in practice. More often than we would like, we have bits of logic in our production systems that are there just because they work and we don't always know exactly why. While we look at them and say "yuck" (or to use a different metaphor, they "smell"), the cost to figure out what is really going on may not make economic sense (do you want to satisfy your scientific curiosity, or would you rather implement a new feature?). Or if the problem occurs only at a customer site, the process of figuring out why may be just too annoying to the customer. Just as doctors will sometimes just treat a rash with cortisone without fully knowing what caused it, software engineers occasionally treat the symptom without explaining the cause...

Support

2013-11-12T13:46:00.003-08:00

Yes, the topic of support covers a lot of ground – for this post, I’d like to focus on the things we should do as developers to be ready for support.

First, a reminder: we are imperfect beings in a messy and imperfect world. That means that the wonderful, perfect application you are in the process of creating is going to fail, (unless it never gets used, but of course, that’s a different problem…). To invoke a bit of Murphy’s law, it is going to fail at the worst possible time (probably just before the end of the last sprint of the next release), in the hands of your most critical customer, and in the most subtle and obscure way that can’t be recreated in-house. And of course, because you wrote that code, they’re going to ask you to figure out what’s wrong. So, what’s your strategy? You should be selfish, and set yourself up to handle that call as quickly and as effectively as possible when it comes in. The good news – your selfish objectives match those of your employer, and the end-user!

Let’s remember some of the challenges you and your compatriots in the front-line support team face:

Very often, the complaint from the customer takes the form of “it didn’t work” (to which you want to ask, what DID it do?), “the action never completed”, “I got an error message” (did you write it down or take a screen shot?). Often the user doesn’t remember what they were doing or even when (to be fair, they weren’t expecting to have problems, so weren’t paying careful attention…). In some cases (e.g. for national security agencies, or some military installations) you don’t get to watch the error, and/or they need to “sanitize” log files before sending them. When the error is happening on a production system, usually your options for experimentation are limited; and in some cases, you need to wait as much as a couple of weeks for a deployment window to install patches or even diagnostic code. And finally, every customer’s configuration is a bit different, and some of those differences create unique situations that can be hard to even notice. (Examples: the customer that had 300 Active Directory Domains – for most actions, this was actually fine, until we needed to find a user that was no longer in his original domain –checking all 300 domains took a while… Or the customer that blocked access to some branches of the Active Directory Hierarchy, but only for some accounts.)

When thinking about how to anticipate and respond to issues, keep in mind that you have a number of audiences that need to know about the error:

The end user who was trying to do something – you owe him or her some information that the action or analysis didn’t go fully as planned. You also need to be careful not to give him/her too much information (see the last post mostly talks about this, so I won’t get into this communication in this post).

The IT staff needs more information to help the user with his/her problem, and/or who needs to fix an underlying problem (a corrupt SharePoint site, a missing web part, missing permissions, …)

The operations staff needs to know about some kinds of errors (if the SQL server is down, or network connections aren’t being made, or your application has irretrievably crashed, they need to take action. (Note that in some environments, the operations staff are separate from the IT staff, sometimes in a different location, and sometimes working for a different vendor. One example of this is Microsoft Online Dedicated SharePoint environments – the operations staff are Microsoft data center people, but what I’m calling the IT staff above are customer employees (or contracted employees) with limited access to the operations staff.)

You! (or your fellow developers) who may be called upon to diagnose and work around the problem.

Where do you capture this information?

For the operations staff, often they are using big expensive monitoring software (such as Microsoft’s System Center) to alert them when something needs their attention. Generally, the best way to make information available to these tools is to place information into the Windows Event log. In general, you want to create a separate source for your application, and unique event IDs for each distinct event – this allows the operations staff to create customized rules for alerting on specific events (they might decide that some events are worth waking people up at 3am for, and others can wait for business hours). Also note that in general, you only want to write events that need attention by the operations staff (don’t flood them with more routine functional errors). That suggests that you want to be able to make a distinction between kinds of errors based on the audience that needs to see them – more on this below).

For the customer-oriented IT staff, and for the developers, information could be written to tables in the database, or text files. While there are benefits for each, my own preference is for plain old text files. They are easy to purge, easy to compress, easy to send through email, and there are an assortment of tools for searching and filtering them (remember, grep isn't just for Unix weenies, and even Excel can be used effectively for sorting and filtering). Two recommendations on text files: use an infrastructure that allows you to roll the files over based on dates and/or size (to keep them from growing indefinitely), and give the customer the option to place them on a non-default drive (some environments prohibit writing log files to the C drive, because of concerns they could consume all of the disk space on a critical drive.) In a Microsoft SharePoint Online Dedicated environment, you have to write to ULS logs, which has some advantages (automatic space management, and some nice viewing applications) but some disadvantages (they can be VERY noisy and large…)

OK, what should you put into the log entries for the IT staff and yourself

First of all, you want to give yourself the ability to select the amount and type of detail to give you and support and the customer the freedom to deal with different situations without overwhelming *everyone* with too much information. By default, you normally will capture only errors, but for troubleshooting situations, both SharePoint itself and ControlPoint adopted a two-dimensional model of settings – the first dimension selects the functional area of the application that needs investigation (e.g. active directory interactions, or dealing with metadata, or database interactions), and the second dimension chooses the level of detail (For ControlPoint, we supported None, Error, Warning, Information, Verbose – I would suggest sub-dividing the Error into two categories (Infrastructure Error, and Application Error – this lets you distinguish between the messages that need to go to the event viewer for consumption by the operations team (can’t connect to the SQL server), from the errors that merely elaborate on a report that didn’t run to completion. The important distinction here is who needs to see the error (and therefore where it needs to be written). So, the customer could choose to write infrastructure errors to the event log, but not application errors.

Finally, when you write an entry to the log file, what does it need to include? The underlying principle is “whatever you’re going to need to sort out the problem”. (Of course, when you’re writing the code, you don’t know what the problem will be, or you’d fix it before it happened…)

I would include at least the following:

Date and Time (use the server’s time, not UTC – it will be a lot easier to find entries)

If it is possible, what action the user was engaged in (this is important for two reasons – first, the user many not even remember what s/he was doing, and many different users might be using the system at the same time).

If it is possible, the name of the user (again, if multiple users are using the system, this can be useful to distinguish one users’s activity from another’s)

If multiple processes may be in use, the process ID, and if multiple threads may be used, the thread ID (similar to the user ID, this can be useful to distinguish one user’s activity from another)

As noted above, a unique code for each application-level message (error, warning, information, etc.) – this is important both for external monitoring tools, but can be useful for filtering entries while analyzing a log file.

When an error is returned or an exception thrown from an API call, the error number and/or the message returned.

The stack trace (this can tell you a lot about what the application was doing at the time of the error.)

When appropriate, the actual data being passed to the API or other method. (So, for example, instead of merely reporting an error trying to instantiate a site, list or item object, provide the exact URL of the site, list or item – this can help to expose either subtle logic flaws, or configuration errors. Let’s say that the user configured a URL without a trailing slash when it was needed – the absence of the slash could show up in the data passed to the SPSite constructor. Or if an active directory group name is shown as a long number, it could expose the fact that the code failed to translate the group’s SID into a name.)

Granularity

2013-11-06T19:53:00.003-08:00

In my last post, one of the recommendations I made was to "Keep on Truckin'", in other words keep the application running and accomplishing as much as can be done, even in the face of failures and bad inputs in the environment. In this article, I want to explore that a bit more, and specifically talk about the granularity of the response to errors.

But first, a side observation: Exception handing is one of the more wonderful things that Bjarne Stroustrup incorporated into C++ and that were adopted by derivative languages. While there is an overhead associated with Try..catch blocks, I would argue that the value of extensive and fine-grained use of exception handlers is well worth the relatively small cost. Fine-grained exception handling is a good thing.

Which brings us back to granularity: my advice is to trap, handle, and recover from errors in the environment at the smallest practicable granularity. So, what does that mean? Of course it is hard to say in general ("practical" is so subjective!), but I would say that the trapping of errors and/or the location of the try...catch blocks should be at least at the level of the lowest-level item included explicitly in an analysis, or acted upon by an action.

Some examples might help:

In Outlook, suppose that I've got a .pst file for my email from the 1980's (of course, I converted it from my ccMail client...), and I'm searching for email from Bill Gates. Suppose the report of the standardization committee on the Basic language was somehow corrupted. If the exception handling is at the level of the .pst file, I might not find any messages from Bill, but if the exception handling is at the level of the item, I'd miss the committee report, but I might see the "Thanks for your valuable suggestion" (No, I never got one of those, but if I did, you can be sure I'd hang onto it :-)
In ControlPoint, we implemented a Comprehensive Permissions Report that can report on permissions right down to the individual list item level - in this case, if we are unable to read the permissions on an individual item, we do our best to report on the permissions of the site, the lists, and any other items that we are able to read.

Note that I've expressed this in terms of exception handling, but remember that we still have API methods that report information in return codes, and/or in return values (as in, a null object pointer may be a null result) - which brings me to a pet peeve: there is no good reason to have an "object reference not set to an instance of an object" error - if your code reports that error, you probably weren't checking a return value. And a recommendation to code reviewers: this is one of the things you should check for.

OK, so you catch an exception or detect an null object reference at an appropriate level - now what do you do? It is worth keeping in mind that you actually have three audiences that need to know something about what happened: the end user who initiated the action, the IT staff that needs to know something went wrong, and the developer and/or support team responsible for the application - each of those wants something slightly different:
User

Of course, you owe it to the user to tell him/her that something went wrong, and that the report s/he asked for is incomplete or that an action wasn't carried out as fully as the user expected. If the report or action has a status message, a completion message like "Completed with errors" is a good balance - you tell the user that results are available, but they are not 100%...

But how do you communicate what is missing, e.g. which sites are not included in the tally of storage, which sites are missing the list of running workflows, etc. For reports, the best solution is to either replace the object name with a string like *** Error *** if you are unable to fetch the name of the object, or append a string to the name if you have the name, but not an important property value. Note that you want to make sure that the string is searchable - in a 3,000 page report, you can't rely on manual inspection to pick out the items with the errors - help the computer to do what computers do best!

In actions, if the action has some form of task audit report (always a good idea!), then simply list the objects that weren't processed because of errors in the task audit report

IT Staff / Developer / Support

While a lot of technical detail will cause most end users to glaze over, most IT staff can tolerate technical detail, even if they don't fully understand it. In other words, you can probably combine the information you want to communicate to the IT staff with the information you want to communicate back to yourself (as a quick preview of a later post - remember, what happens if you take a support call: *you* are one of the audiences you're writing this information for!).

We'll talk about what information to capture and where in a later post, but for now, think about the information that the IT staff needs - I will mention a few things to be sure you include:

What action was the user in the middle of (remember, the log file may contain information about a lot of different user's activities - clarifying the action can help to correlate the log entry to the user's action)
What object did the error occur on? (If the user ran a report on the entire farm, s/he won't know which site was the corrupt one - helping the IT manager find that site can enable them to do something about it.
The message returned from the API - often this contains exactly the information that the IT person needs to know. (E.g. to illustrate with a common problem - if the password changes on a service account, often the error message will tell the IT person exactly what the problem is (maybe even without calling vendor support!)

Note that there is a real dichotomy between the information you present to the user, and the information needed by IT - a couple of notes here:

There are a small number of customers in highly secure environments (curiously, these tend to be in financial industries, rather then military or security agencies) for whom display of API-generated messages to end users is a security risk (since the API message may expose information about the system). You may need to accommodate those users. On the other hand, displaying messages from the API directly to users can often help debugging, both in QA, and when support is online with the customer and can watch what is happening on the screen, so it is very tempting. In ControlPoint, we introduced a customer-settable option that specified whether error details should be displayed to the user.
If you want to display detailed technical information to the user, it would be best to prefix it with something like "Please provide the following information to your help desk or IT department" - that gives the end user permission to not to try to understand it.
While it can be useful to display information about what source data gave rise to the error (e.g. the name of the site that the report couldn't report on), be careful that you don't inadvertently expose information. If the user asked for the storage consumed by the HR site collection, telling the user that the site \HR\Layoffs\AlfredENeumann is inaccessible might expose information the user shouldn't know.
In this vein, the model adopted by SharePoint starting in version 2010 of presenting the user with a sparse message containing a correlation ID works very nicely - the message seen by the user has very little information (maybe even too little), but the correlation ID allows the IT department to find the exact error message in the ULS logs - the only thing they missed was to warn the user that they'd actually need the correlation ID before they dismiss the dialog...

Lessons from medical device software

2013-11-04T07:48:00.001-08:00

A number of years ago, I worked with someone whose husband had an automatic, implantable defibrillator to address heart problems.

As a software engineer, my first reaction was "Wow - they had to get that exactly right". As a Software Development manager, my second reaction was to think about what lessons other less critical software projects could take from that. There are at least three that come to mind:

1) Some things need more care than others.

That lesson can be applied at multiple levels - first, some applications are really critical (the automatic defibrillator could kill you if it fails; the navigation system for the Space Shuttle had to be exactly right or the astronauts wouldn't get home again...), but in contrast, a report on what class of users prefer Kleenex over the store brand isn't going to kill anyone if it is off by 10%.

At another level, within an application, some functions are more critical than others. Windows XP used to (I guess it still does for anyone who's running it...) contain a neat defragmenter that started with a very colorful diagram showing you how badly fragmented the disk is, and then proceeded to move sectors around on disk to reduce the fragmentation. Now, if the report was wrong, frankly, most of us would never know, and probably wouldn't care a lot. On the other hand if it lost a sector on disk during the defragmentation, that could corrupt a really critical file (my status report to my bosss, the grant application for my foundation, my resume...). My hope and expectation is that the team creating that defragmenter spent a lot more time and care on the design, code reviews and testing of the defragmenation aspects of the program than they did on the display.

2) Keep on Truckin'

(Google it, if you're not familiar with the phrase...)

Back to the defibrillator - if it encounters an error (lets say the battery level drops momentarily, or the user gets too close to a microwave that temporarily messes up the sensor that detects the heart's contractions), the defibrillator can't exactly display "Error, unexpected input, please restart" - it has to ignore the spurious input, keep running, and wait for the error condition to clear itself (and if possible, clear the condition).

So how does that apply to those of us who are more focused on enterprise software. It turns out that this is one of the hardest lessons for most application developers - the "Error, unexpected input" response is our first reaction, and it does have the advantage of alerting the user to a problem. But for highly available software (that is what you want to build, isn't it?) that's not the best response.

Consider a mail server; while it normally isn't as life-critical as a defibrillator, mail has become a business critical function in today's world. If your mail server encounters an SMTP protocol error when Joe's Garage is trying to send a "your car is ready" message, you don't want the mail server to shut down and wait for Joe to get his act together. Instead, you want it to continue to listen for other email (the announcement that your company just won that million dollar deal, for instance. or the support question from a customer). If the disk defragmenter encounters a bad sector on a write, you'd like it to mark it as bad, and try another one; or if it encounters a bad sector on a read, you want it to put as much of the file together as it can. If the SharePoint search indexer encounters a corrupted site, you want it to continue to crawl all of the other sites, so that you can search for *most* of the content of the farm.

Now, this needs a bit of care, which begins to overlap with my next lesson. Consider the disk defragmenter - if it gets 100 bad sectors in a row, that's probably an indication that the disk or the controller has gone bad in a big way - continuing to try to defragment the disk could very possibly make things a lot worse. If the mail server gets 20 errors in a row trying to send a message to Joe's Garage ("I'll pick it up at 6:00 pm"), it's probably a waste of time trying the 21st time.

In short, if your "Keep on Truckin'" response is to retry an operation, you normally want an upper bound on the retries, or you could spend more time on the retries than on productive work, which in the end would defeat the "Keep on Truckin'" objective. While your at it, let your customer override your default number of retries (because someone is going to want a different threshold than the rest of your customers.)

3) "First do no harm"

(While those words are not actually in the Hippocratic oath, the spirit of it is there...) In more mundane terms, we need to consider the failure mode of our software.

Continuing with the example of the defibrillator - consider what the defibrillator should do when the battery level begins to drop or even if there is a power surge - you don't want the defibrillator to go into defib mode just because the inputs are out of range - you'd rather it go into quiet mode until things become normal.

So, what does this mean for enterprise software? I'll use an example from Axceler's ControlPoint - the product includes a function to move SharePoint sites - it does this by exporting a copy of the site, importing that copy to the new location, and then as an optionally delayed operation, deleting the source site. First of all, if the import of the copy of the file to the destination has an error, we do not want to delete the source file. Similarly, the Move function has an option to take a separate backup before the move - if the backup has an error, then again, we do not want to delete the source file. In all cases, the response to errors (the failure mode) is to do nothing. While this is fairly obvious when stated this way, it can be easy to miss if you don't think about it while building the function.

Finally, at the functional design level, there are some less obvious things that can be done. We implemented the delayed delete of the source partly to provide a validation and recovery option: if a problem occured with the destination, even a subtle one that the code could not detect, then the user has the the time to notice, and has the option to cancel the delete and revert back to the original site.

Who am I, and why this blog

2013-11-04T07:08:00.003-08:00

I've been building software, generally enterprise-level COTS (Commercial Off The Shelf) software for longer than I care to admit. I've done it in some very large organizations (Wang Laboratories, and Digital Equipment Corporation) and some very small, 20 person organizations. I've had the chance to work with some brilliant people who showed me what to do right, and some folks who showed me things NOT to do. As I have tried to impart that learning to my teams, I tend to fall back on two things: stories and metaphors (and in the end, they're not all that different).

So this blog is an opportunity to capture some of those stories, and occasionally metaphors, in the hopes that some folks will find it useful.