Note that this post addresses only some of the issues that arise in large, complex, and long-running support incidents; it generally focuses on the process more than the specific content of a particular issue. From that process perspective, I’ve seen support issues struggle with three broad aspects of the issue:
- You get a large amount of information, and you don’t know what bits of information are pertinent.
- You don’t get one or several bits of information that you asked the customer for.
- The incident drags on interminably, and never seems to get to closure.
Too much information
At one level, you’d rather have too much information than too little. That’s why, when generating log files, we tend to turn on every class of tracing that could possibly have a bearing on the issue. For now, I won’t talk about the process of extracting meaning from huge and noisy log files (there’s a recorded session on that in the training directory at Axceler/Metalogix). Rather, I’m talking about elements of behavior or the environment that might or might not be significant. Two examples:
- We were working on a problem with a job that worked properly when run interactively, but not when run as a scheduled job. We noticed that every time the scheduler woke up, it would run through the “is there anything waiting to run” assessment several times. We weren’t sure why, and didn’t know whether to first on trying to figure that out, or start with other aspects of the problem.
- Another customer was having uncharacteristically slow performance in a task that looked for permissions granted to users who were no longer in Active Directory. In the course of discussions, it became clear that the customer had many hundreds of active directory domains. We weren’t sure whether that might be related, and an incidental observation.
The art of support is figuring out what bits of the environment, or which unexpected behaviors are worth spending time to explore, and which should be ignored. As you make that judgment, consider a couple of issues:
Initially, think about what makes logical sense. Using the first example above, do you expect that having the scheduler run several times should have anything to do with what looked like a permission problem? If the behavior or element of the environment doesn’t feel like it should be related, then temporarily ignore it and look elsewhere. (But of course, be prepared to come back to this if other factors get ruled out – remember Sherlock Holme’s aphorism: “Once you have eliminated everything that is impossible, whatever is left, however improbable, has to be the explanation”). [ Historical note: we decided that the multiple runs of the scheduler was not the source of the problem. In the end, we determined that it was a relatively low impact configuration problem that had no impact on the significant issue we were trying to solve. ]
Also, ask yourself, if that behavior had anything to do with the problem you are exploring, would you have seen any other symptoms? In the second example, if the number of active directory domains had anything to do with the performance of the problem report, would we have seen this at other customers, or would we see some kind of impact on other reports at this customer. Often this line of reasoning can lead to other questions (e.g. Do we have any other customers with hundreds of Active Directory Domains? How long does it take to display the members of an active directory group in a different domain? Perhaps we should turn on the logging for the code that accesses active directory.) [ Historical note: the number of domains turned out to be the issue – connecting to one domain controller has a modest cost, but multiplied by 300 it adds up significantly. ]
There are two big risks for this line of reasoning:
- Beware of asking the customer too many questions – in a long-running support call, you run the risk of exhausting the customer’s good will – save that good will for the most important questions.
- Also, as noted below, be very careful to notice whether you are going down a rat hole.
You don’t get all of the information you ask for
The first recommendation is to make sure you really need the information you asked for. Note that in some cases, the customer may not have answered as a simple oversight, but in many cases, the customer has not answered because they don’t know the answer and/or don’t know how to get the answer. Insisting on the answer to a question can use up good will, so before asking again, think about whether it was important or whether it was an “if you happen to know…” question.
If it was important, then ask the question again, but perhaps try to phrase it differently (maybe they didn’t understand the question) and/or tell them how to answer (check this setting, look in this log file, send me a screen shot of this screen…)
Sometimes explaining why you need the information can motivate them to spending a bit of extra effort to find the answer. Note that to some extent, explaining why you are asking will help to engage the customer, help to build a relationship and reduce the erosion of good will.
Finally, sometimes it can help to explain what you are looking for in general terms, and then ask them if they know anything that would shed any light on the problem. For example, in the first example above, we could ask “Can you tell if the timer job itself is running repeatedly”, “Are you aware of any other timer jobs that run repeatedly”. (Note that this approach tends to be a long-shot, but every now and then, a Hail Mary pass connects and wins the ball game J )
The incident drags on and doesn’t reach a conclusion.
First, let me note that when you’re in the middle of a long-running support call, you tend to be focused on the last answer you got, and the next thing to try – it can be very hard to notice that the call has gotten bogged down. This is where it is very good if the support team has an automatic rule for escalating issues that remained open for some period of time – use that escalation to think about whether you should be doing something differently.
Second, note that it is VERY easy for a long-running support case to go down a rat-hole without even noticing it – you ask for a bit of information that would be nice, but not essential, and if the customer has a problem getting that, then finding that tidbit of information can turn into a project of its own, even though it may not be the most important expenditure of time, effort and customer’s good will.
So, when a case drags on, periodically, it is useful to regroup. Do this at least figuratively, but very often literally bringing the players together in a room. In that regrouping, ask the following questions (and write the answers down):
- What do we know (include in this things that you think are unimportant, but you’re not 100% sure of the significance of.)
- What don’t we know that we wish we did (and don’t limit this to things you think you can get the answer to – as you brainstorm, someone may think of a way to answer the question.)
- What have we tried (or questions we’ve asked) and what were the results/answers. (Among other things you want to avoid going over the same ground again with the customer; this step is particularly important if new people have joined the incident team
- What haven’t we tried yet (in a sense, this is a variation of what we don’t know…)
Treat this process of collecting information as a sort of brainstorming session. Part of the value of this is to notice:
- Things that might be significant, that you hadn’t seen before
- Things that you haven’t tried yet (or questions you haven’t asked), but should.
- That you have gone down a rathole, or are focused on an aspect of the problem that isn’t the most fruitful avenue.
BTW: at the end of this, you may be inclined to ask the customer a question you’ve asked before, either because you didn’t trust the answer, or because the answer wasn’t complete. As mentioned above, you should avoid this if possible, as it tends to use up good will, but if it is necessary, it can help if you acknowledge you’re repeating the question: “I know we asked this before, but we want to double-check that we heard correctly”, or “... we wanted to double check one aspect of …”