Note that this post addresses only some of the issues that
arise in large, complex, and long-running support incidents; it generally focuses on the process more than
the specific content of a particular issue.
From that process perspective, I’ve seen support issues struggle with three
broad aspects of the issue:
- You get a large amount of information, and you don’t know what bits of information are pertinent.
- You don’t get one or several bits of information that you asked the customer for.
- The incident drags on interminably, and never seems to get to closure.
Too much information
At one level, you’d rather have too much information than
too little. That’s why, when generating
log files, we tend to turn on every class of tracing that could possibly have a
bearing on the issue. For now, I won’t
talk about the process of extracting meaning from huge and noisy log files
(there’s a recorded session on that in the training directory at
Axceler/Metalogix). Rather, I’m talking
about elements of behavior or the environment that might or might not be
significant. Two examples:
- We were working on a problem with a job that worked properly when run interactively, but not when run as a scheduled job. We noticed that every time the scheduler woke up, it would run through the “is there anything waiting to run” assessment several times. We weren’t sure why, and didn’t know whether to first on trying to figure that out, or start with other aspects of the problem.
- Another customer was having uncharacteristically slow performance in a task that looked for permissions granted to users who were no longer in Active Directory. In the course of discussions, it became clear that the customer had many hundreds of active directory domains. We weren’t sure whether that might be related, and an incidental observation.
The art of support is figuring out what
bits of the environment, or which unexpected behaviors are worth spending time
to explore, and which should be ignored.
As you make that judgment, consider a couple of issues:
Initially, think about what makes logical sense. Using the first example above, do you expect
that having the scheduler run several times should have anything to do with
what looked like a permission problem?
If the behavior or element of the environment doesn’t feel like it
should be related, then temporarily ignore it and look elsewhere. (But of course, be prepared to come back to
this if other factors get ruled out – remember Sherlock Holme’s aphorism: “Once
you have eliminated everything that is impossible, whatever is left, however
improbable, has to be the explanation”).
[ Historical note: we decided that the multiple runs of the scheduler
was not the source of the problem. In
the end, we determined that it was a relatively low impact configuration
problem that had no impact on the significant issue we were trying to solve. ]
Also, ask yourself, if that behavior had anything to do with
the problem you are exploring, would you have seen any other symptoms? In the second example, if the number of active
directory domains had anything to do with the performance of the problem
report, would we have seen this at other customers, or would we see some kind
of impact on other reports at this customer.
Often this line of reasoning can lead to other questions (e.g. Do we
have any other customers with hundreds of Active Directory Domains? How long does it take to display the members
of an active directory group in a different domain? Perhaps we should turn on the logging for the
code that accesses active directory.) [
Historical note: the number of domains turned out to be the issue – connecting to
one domain controller has a modest cost, but multiplied by 300 it adds up
significantly. ]
There are two big risks for this line of reasoning:
- Beware of asking the customer too many questions – in a long-running support call, you run the risk of exhausting the customer’s good will – save that good will for the most important questions.
- Also, as noted below, be very careful to notice whether you are going down a rat hole.
You don’t get all of
the information you ask for
The first recommendation is to make sure you really need the
information you asked for. Note that in
some cases, the customer may not have answered as a simple oversight, but in
many cases, the customer has not answered because they don’t know
the answer and/or don’t know how to get the answer. Insisting on the answer to a question can use
up good will, so before asking again, think about whether it was important or
whether it was an “if you happen to know…” question.
If it was important, then ask the question
again, but perhaps try to phrase it differently (maybe they didn’t understand
the question) and/or tell them how to answer (check this setting, look in this
log file, send me a screen shot of this screen…)
Sometimes explaining why you need the information can
motivate them to spending a bit of extra effort to find the answer. Note that to some extent, explaining why you
are asking will help to engage the customer, help to build a relationship and
reduce the erosion of good will.
Finally, sometimes it can help to explain what you are
looking for in general terms, and then ask them if they know anything that
would shed any light on the problem. For
example, in the first example above, we could ask “Can you tell if the timer
job itself is running repeatedly”, “Are you aware of any other timer jobs that
run repeatedly”. (Note that this
approach tends to be a long-shot, but every now and then, a Hail Mary pass
connects and wins the ball game J
)
The incident drags on
and doesn’t reach a conclusion.
First, let me note that when you’re in the middle of a
long-running support call, you tend to be focused on the last answer you got,
and the next thing to try – it can be very
hard to notice that the call has gotten bogged down. This is where it is very good if the support
team has an automatic rule for escalating issues that remained open for some
period of time – use that escalation to think about whether you should be doing
something differently.
Second, note that it is VERY easy for a long-running support
case to go down a rat-hole without even noticing it – you ask for a bit of
information that would be nice, but not essential, and if the customer has a
problem getting that, then finding that tidbit of information can turn into a
project of its own, even though it may not be the most important expenditure of
time, effort and customer’s good will.
So, when a case drags on, periodically, it is useful to
regroup. Do this at least figuratively,
but very often literally bringing the players together in a room. In that regrouping, ask the following
questions (and write the answers down):
- What do we know (include in this things that you think are unimportant, but you’re not 100% sure of the significance of.)
- What don’t we know that we wish we did (and don’t limit this to things you think you can get the answer to – as you brainstorm, someone may think of a way to answer the question.)
- What have we tried (or questions we’ve asked) and what were the results/answers. (Among other things you want to avoid going over the same ground again with the customer; this step is particularly important if new people have joined the incident team
- What haven’t we tried yet (in a sense, this is a variation of what we don’t know…)
Treat this process of collecting information as a sort of
brainstorming session. Part of the value
of this is to notice:
- Things that might be significant, that you hadn’t seen before
- Things that you haven’t tried yet (or questions you haven’t asked), but should.
- That you have gone down a rathole, or are focused on an aspect of the problem that isn’t the most fruitful avenue.
BTW: at the end of this, you may be inclined to ask the
customer a question you’ve asked before, either because you didn’t trust the
answer, or because the answer wasn’t complete.
As mentioned above, you should avoid this if possible, as it tends to
use up good will, but if it is necessary, it can help if you acknowledge you’re
repeating the question: “I know we asked this before, but we want to
double-check that we heard correctly”, or “... we wanted to double check one
aspect of …”