Stories: December 2013

Note that this post addresses only some of the issues that arise in large, complex, and long-running support incidents; it generally focuses on the process more than the specific content of a particular issue. From that process perspective, I’ve seen support issues struggle with three broad aspects of the issue:

You get a large amount of information, and you don’t know what bits of information are pertinent.
You don’t get one or several bits of information that you asked the customer for.
The incident drags on interminably, and never seems to get to closure.

Too much information

At one level, you’d rather have too much information than too little. That’s why, when generating log files, we tend to turn on every class of tracing that could possibly have a bearing on the issue. For now, I won’t talk about the process of extracting meaning from huge and noisy log files (there’s a recorded session on that in the training directory at Axceler/Metalogix). Rather, I’m talking about elements of behavior or the environment that might or might not be significant. Two examples:

We were working on a problem with a job that worked properly when run interactively, but not when run as a scheduled job. We noticed that every time the scheduler woke up, it would run through the “is there anything waiting to run” assessment several times. We weren’t sure why, and didn’t know whether to first on trying to figure that out, or start with other aspects of the problem.
Another customer was having uncharacteristically slow performance in a task that looked for permissions granted to users who were no longer in Active Directory. In the course of discussions, it became clear that the customer had many hundreds of active directory domains. We weren’t sure whether that might be related, and an incidental observation.

The art of support is figuring out what bits of the environment, or which unexpected behaviors are worth spending time to explore, and which should be ignored. As you make that judgment, consider a couple of issues:

Initially, think about what makes logical sense. Using the first example above, do you expect that having the scheduler run several times should have anything to do with what looked like a permission problem? If the behavior or element of the environment doesn’t feel like it should be related, then temporarily ignore it and look elsewhere. (But of course, be prepared to come back to this if other factors get ruled out – remember Sherlock Holme’s aphorism: “Once you have eliminated everything that is impossible, whatever is left, however improbable, has to be the explanation”). [ Historical note: we decided that the multiple runs of the scheduler was not the source of the problem. In the end, we determined that it was a relatively low impact configuration problem that had no impact on the significant issue we were trying to solve. ]

Also, ask yourself, if that behavior had anything to do with the problem you are exploring, would you have seen any other symptoms? In the second example, if the number of active directory domains had anything to do with the performance of the problem report, would we have seen this at other customers, or would we see some kind of impact on other reports at this customer. Often this line of reasoning can lead to other questions (e.g. Do we have any other customers with hundreds of Active Directory Domains? How long does it take to display the members of an active directory group in a different domain? Perhaps we should turn on the logging for the code that accesses active directory.) [ Historical note: the number of domains turned out to be the issue – connecting to one domain controller has a modest cost, but multiplied by 300 it adds up significantly. ]

There are two big risks for this line of reasoning:

Beware of asking the customer too many questions – in a long-running support call, you run the risk of exhausting the customer’s good will – save that good will for the most important questions.
Also, as noted below, be very careful to notice whether you are going down a rat hole.

You don’t get all of the information you ask for

The first recommendation is to make sure you really need the information you asked for. Note that in some cases, the customer may not have answered as a simple oversight, but in many cases, the customer has not answered because they don’t know the answer and/or don’t know how to get the answer. Insisting on the answer to a question can use up good will, so before asking again, think about whether it was important or whether it was an “if you happen to know…” question.

If it was important, then ask the question again, but perhaps try to phrase it differently (maybe they didn’t understand the question) and/or tell them how to answer (check this setting, look in this log file, send me a screen shot of this screen…)

Sometimes explaining why you need the information can motivate them to spending a bit of extra effort to find the answer. Note that to some extent, explaining why you are asking will help to engage the customer, help to build a relationship and reduce the erosion of good will.

Finally, sometimes it can help to explain what you are looking for in general terms, and then ask them if they know anything that would shed any light on the problem. For example, in the first example above, we could ask “Can you tell if the timer job itself is running repeatedly”, “Are you aware of any other timer jobs that run repeatedly”. (Note that this approach tends to be a long-shot, but every now and then, a Hail Mary pass connects and wins the ball game J )

The incident drags on and doesn’t reach a conclusion.

First, let me note that when you’re in the middle of a long-running support call, you tend to be focused on the last answer you got, and the next thing to try – it can be very hard to notice that the call has gotten bogged down. This is where it is very good if the support team has an automatic rule for escalating issues that remained open for some period of time – use that escalation to think about whether you should be doing something differently.

Second, note that it is VERY easy for a long-running support case to go down a rat-hole without even noticing it – you ask for a bit of information that would be nice, but not essential, and if the customer has a problem getting that, then finding that tidbit of information can turn into a project of its own, even though it may not be the most important expenditure of time, effort and customer’s good will.

So, when a case drags on, periodically, it is useful to regroup. Do this at least figuratively, but very often literally bringing the players together in a room. In that regrouping, ask the following questions (and write the answers down):

What do we know (include in this things that you think are unimportant, but you’re not 100% sure of the significance of.)
What don’t we know that we wish we did (and don’t limit this to things you think you can get the answer to – as you brainstorm, someone may think of a way to answer the question.)
What have we tried (or questions we’ve asked) and what were the results/answers. (Among other things you want to avoid going over the same ground again with the customer; this step is particularly important if new people have joined the incident team
What haven’t we tried yet (in a sense, this is a variation of what we don’t know…)

Treat this process of collecting information as a sort of brainstorming session. Part of the value of this is to notice:

Things that might be significant, that you hadn’t seen before
Things that you haven’t tried yet (or questions you haven’t asked), but should.
That you have gone down a rathole, or are focused on an aspect of the problem that isn’t the most fruitful avenue.

BTW: at the end of this, you may be inclined to ask the customer a question you’ve asked before, either because you didn’t trust the answer, or because the answer wasn’t complete. As mentioned above, you should avoid this if possible, as it tends to use up good will, but if it is necessary, it can help if you acknowledge you’re repeating the question: “I know we asked this before, but we want to double-check that we heard correctly”, or “... we wanted to double check one aspect of …”

No, I'm not talking about how our neurons communicate. Nor am I talking about what we think about (there are plenty of blogs on that topic). Rather, I want to discuss the mental process of moving from an elementary idea or an ill-understood problem to a well-structured proposal or solution. Note that I think that we can and should use a very similar process in at least two somewhat different circumstances:

Elaborating on a very high-level, briefly expressed, requirement into a more complete definition that is sufficient for implementation.

Understanding the details and underpinnings of a bug sufficiently that you can define what to do about it.

Even though this blog is directed at engineers, I suggest that we would do well to keep the principles of the scientific method in mind. What do I mean by that? Well, when we look at the incredible simplicity and beauty of formulas like F=ma, E=IR, E=mc**2, it is very tempting to think that these laws sprang fully formed from the brains of the authors. As brilliant as Newton, Ohm, and Einstein were, the reality is much messier than that. First of all, in the broad sweep of history, it took generations to get past the idea of Earth, Air, Fire, and Water to get to the point where we could even contemplate F=ma. (Newton said, "if I have seen far, it is because I stood on the shoulders of giants"). At the individual level, even those geniuses took many years to fully understand what was going on and to fully distill that understanding to a simple and coherent statement of truth.

So, my first message: give yourself permission to do what Newton and Einstein did - i.e. give yourself permission not to completely understand the problem at the beginning of your exploration.

OK, so what does that mean in practice? Just as most modern software development life cycle methodologies build upon the principle of iterative refinement, our thinking process should do the same. Don't try to come to a conclusion too quickly - give yourself permission to more completely understand the problem space, and to get an idea of what is going on and what you even need to observe, before you try to formulate a rigorous experiment and/or to formulate a coherent proposal.

In the language of science and mathematics, you need to do some somewhat more unstructured playing and observing before you can:

Design the experiment (determine what steps/operations you need to perform).

Decide what the independent variables are (these are the things that you want to vary in the process of the experiment - you are trying to determine what the impact of these variables is).

Decide what the dependent variables are (these are the things that you want to make careful observations of during the experiment.

In the end, come to a conclusion about causes and effects and what to do about them.

I've been a bit abstract so far, so let me give some examples, which I will draw from ControlPoint development.

Analyzing a bug

We had 2 or 3 customers complain that ControlPoint was marking many (but not all) of the SharePoint users as having logged on recently, even though they had not. (As an interesting aside - Active Directory maintains this information, and these customers were using this to identify inactive accounts that might need to be shut down – ControlPoint’s behavior was interfering with that...)

Of course, the first step was to confirm that the application really is causing this (that it wasn't something else in the environment - cause and effect can be a slippery thing at times.). We were able to recreate the problem, although not consistently, so we were comfortable that we were at least triggering it. However we didn't yet know why it happened on one server but not another.

So, the next question was whether the application was actually doing this explicitly. Logically, this seemed unlikely - doing this explicitly would mean that we would need to have the credentials for all of those accounts, which we didn't have. And, empirically, we did not find any code that obviously did a logon.

This meant that this was probably a side effect of something else we were doing, but what was that (and was there anything we could do about it?)

At about this time we realized (partly through logic and partly through observation) that anytime that active directory recorded that the user had logged in recently, the event viewer recorded a login event in the security section. This gave us a much more useful (in the sense of being more immediate) way of observing the effect - we had now identified our dependent variable for experimentation. (Note that at this point, we are still exploring, and aren't ready for a rigorous experiment yet, because we haven't identified the independent variables, and don't yet have a hypothesis to test.). This allowed us to confirm something we suspected, that the problem was occurring somewhere in our nightly discovery process. But what part of discovery, and what specific action was causing the effect?

We observed that from the perspective of ControlPoint, there are a number of different user types, that we do different things with - there is the user who is running ControlPoint, there are users who are authorized to run ControlPoint, as a subclass of those, there are business admins, and finally, there are the ordinary users of SharePoint who have been given rights to SharePoint (but not ControlPoint). So, we formed the hypothesis that the type of user may have been what was different between cases where we observed the problem and didn't observe the problem - we now had the independent variable for our experiment, and we were now ready to graduate from relatively unstructured experimenting and observation to create a carefully structured experiment.

The experiment took the following shape: create brand new accounts (since in SharePoint existing accounts can be treated somewhat differently by SharePoint) with the following characteristics:

Farm admin who is also a ControlPoint (ordinary) admin

Farm admin who is also a ControlPoint business admin

Site Collection admin who is a ControlPoint (ordinary) admin

Site Collection admin who is a ControlPoint business admin

User with Full Control who is a ControlPoint business admin

User with Full Control who has no rights to ControlPoint

Having distinct and separate accounts, allowed us to clearly identify the impact of the type of account – in other words, the account type was an independent variable.

We ran discovery, and then observed which accounts triggered a login event in the event viewer log. In the end, this allowed us to determine that it was any user with rights to use ControlPoint, which in turn allowed us to narrow it down to a WindowsIdentity system call, which we were using to determine what groups the user was in so that we could determine what rights that user might be getting from those groups. (It appears that the operating system is doing an implicit impersonation of the user, which in turn amounts to a login...)

Armed with that knowledge, we were able to come up a different mechanism to get the list of groups, and thereby avoid the impersonation/login for each of the users.

Note that the process above unfolded over the course of a couple of weeks – clarity does not come in a flash (even Archimedes’ moment of “Eureka” followed a lot of thinking!)

As a side note: there is another non-technical process that you should use here: give the problem some thought, explore the details and the alternatives, and then intentionally set it aside, ideally at least overnight. Your brain has a remarkable background processor that works on problems while your attention is elsewhere – when you come back to the problem, it is often a lot clearer than when you set it aside.

Responding to requirements

Consider the requirement “We need to be able to duplicate workflows”. As developers, our first objective is to elaborate this into enough detail that we can know what we need to build, that we can ensure that we have the same idea of what is needed as the product owner, and that we can fairly accurately estimate the effort for this. Normally, that means that we need to understand workflows enough to give the product owner some useful background, to ask intelligent targeting questions, and ultimately to propose a set of functionality that delivers useful value to the customer (the goal of agile, of course) while providing value to the company.

Unless you are already an expert in SharePoint workflow, getting to that understanding requires some time exploring, reading, and experimenting – the result of that may be the understanding that the following factors affect the duplicate functionality:

Version of SharePoint (2007, 2010, 2013), and compatibility among versions (e.g. 2013 supports 2010 style workflows, but also supports an entirely new workflow architecture)

Types of workflow (Out of the Box, individually defined, reusable)

The elements of the workflow, i.e. the definition, the association with a particular library, the instances of the workflow, the history list, the task list

Versioning of the workflow definition (i.e. an older instance might still be running with an older version of the workflow rules, and a newer instance running with a revised set of workflow rules)

Given this increasing understanding of workflows, this exercise is not going to culminate in a hypothesis and a rigorous experiment, but it does lead to a step of increasing rigor in understanding and questions. So, at this point, we are ready to more completely articulate what resources/artifacts are involved in a workflow, and what resources/artifacts are shared among different workflows, and therefore need consideration when duplicating workflows. We’re now prepared to have a discussion with the product owner of whether the initial implementation might be limited to 2010 and 2007 style workflows. And we can think about what it means to duplicate a workflow that involves resources that are shared with other workflows.

What is common among those examples?

Recognizing that you won’t always understand the problem space up front

Recognizing that your understanding will improve iteratively

Setting a goal of increasing the precision of your understanding with each iteration.

Stories

Saturday, December 28, 2013

How to approach support issues

Monday, December 2, 2013

How do we think?