Yes, the topic of support covers a lot of ground – for this
post, I’d like to focus on the things we should do as developers to be ready
for support.
First, a reminder: we are imperfect beings in a messy and
imperfect world. That means that the
wonderful, perfect application you are in the process of creating is going to
fail, (unless it never gets used, but of course, that’s a different problem…). To invoke a bit of Murphy’s law, it is going
to fail at the worst possible time (probably just before the end of the last
sprint of the next release), in the hands of your most critical customer, and
in the most subtle and obscure way that can’t be recreated in-house. And of course, because you wrote that code,
they’re going to ask you to figure out what’s wrong. So, what’s your strategy? You should be selfish, and set yourself up to
handle that call as quickly and as effectively as possible when it comes in. The good news – your selfish objectives match
those of your employer, and the end-user!
Let’s remember some of the challenges you and your
compatriots in the front-line support team face:
Very often, the complaint from the customer takes the form
of “it didn’t work” (to which you want to ask, what DID it do?), “the action
never completed”, “I got an error message” (did you write it down or take a
screen shot?). Often the user doesn’t remember
what they were doing or even when (to be fair, they weren’t expecting to have
problems, so weren’t paying careful attention…). In some cases (e.g. for national security
agencies, or some military installations) you don’t get to watch the error,
and/or they need to “sanitize” log files before sending them. When the error is happening on a production
system, usually your options for experimentation are limited; and in some
cases, you need to wait as much as a couple of weeks for a deployment window to
install patches or even diagnostic code.
And finally, every customer’s configuration is a bit different, and some
of those differences create unique situations that can be hard to even
notice. (Examples: the customer that had
300 Active Directory Domains – for most actions, this was actually fine, until
we needed to find a user that was no longer in his original domain –checking
all 300 domains took a while… Or the
customer that blocked access to some branches of the Active Directory
Hierarchy, but only for some accounts.)
When thinking about how to anticipate and respond to issues,
keep in mind that you have a number of audiences that need to know about the
error:
- The end user who was trying to do something – you owe him or her some information that the action or analysis didn’t go fully as planned. You also need to be careful not to give him/her too much information (see the last post mostly talks about this, so I won’t get into this communication in this post).
- The IT staff needs more information to help the user with his/her problem, and/or who needs to fix an underlying problem (a corrupt SharePoint site, a missing web part, missing permissions, …)
- The operations staff needs to know about some kinds of errors (if the SQL server is down, or network connections aren’t being made, or your application has irretrievably crashed, they need to take action. (Note that in some environments, the operations staff are separate from the IT staff, sometimes in a different location, and sometimes working for a different vendor. One example of this is Microsoft Online Dedicated SharePoint environments – the operations staff are Microsoft data center people, but what I’m calling the IT staff above are customer employees (or contracted employees) with limited access to the operations staff.)
- You! (or your fellow developers) who may be called upon to diagnose and work around the problem.
Where do you capture this information?
For the operations staff, often they are using big expensive
monitoring software (such as Microsoft’s System Center) to alert them when
something needs their attention. Generally, the best way to make information
available to these tools is to place information into the Windows Event
log. In general, you want to create a
separate source for your application, and unique event IDs for each distinct
event – this allows the operations staff to create customized rules for
alerting on specific events (they might decide that some events are worth
waking people up at 3am for, and others can wait for business hours). Also note that in general, you only want to
write events that need attention by the operations staff (don’t flood them with
more routine functional errors). That suggests
that you want to be able to make a distinction between kinds of errors based on
the audience that needs to see them – more on this below).
For the customer-oriented IT staff, and for the developers,
information could be written to tables in the database, or text files. While there are benefits for each, my own preference
is for plain old text files. They are
easy to purge, easy to compress, easy to send through email, and there are an
assortment of tools for searching and filtering them (remember, grep isn't just
for Unix weenies, and even Excel can be used effectively for sorting and
filtering). Two recommendations on text
files: use an infrastructure that allows you to roll the files over based on
dates and/or size (to keep them from growing indefinitely), and give the
customer the option to place them on a non-default drive (some environments
prohibit writing log files to the C drive, because of concerns they could
consume all of the disk space on a critical drive.) In a Microsoft SharePoint Online Dedicated
environment, you have to write to ULS logs, which has some advantages
(automatic space management, and some nice viewing applications) but some
disadvantages (they can be VERY noisy and large…)
OK, what should you put into the log entries for the IT
staff and yourself
First of all, you want to give yourself the ability to
select the amount and type of detail to give you and support and the customer
the freedom to deal with different situations without overwhelming *everyone*
with too much information. By default,
you normally will capture only errors, but for troubleshooting situations, both
SharePoint itself and ControlPoint adopted a two-dimensional model of settings –
the first dimension selects the functional area of the application that needs
investigation (e.g. active directory interactions, or dealing with metadata, or
database interactions), and the second dimension chooses the level of detail
(For ControlPoint, we supported None, Error, Warning, Information, Verbose – I would
suggest sub-dividing the Error into two categories (Infrastructure Error, and Application
Error – this lets you distinguish between the messages that need to go to the event
viewer for consumption by the operations team (can’t connect to the SQL
server), from the errors that merely elaborate on a report that didn’t run to
completion. The important distinction
here is who needs to see the error (and therefore where it needs to be written). So, the customer could choose to write
infrastructure errors to the event log, but not application errors.
Finally, when you write an entry to the log file, what does
it need to include? The underlying
principle is “whatever you’re going to need to sort out the problem”. (Of course, when you’re writing the code, you
don’t know what the problem will be, or you’d fix it before it happened…)
I would include at least the following:
- Date and Time (use the server’s time, not UTC – it will be a lot easier to find entries)
- If it is possible, what action the user was engaged in (this is important for two reasons – first, the user many not even remember what s/he was doing, and many different users might be using the system at the same time).
- If it is possible, the name of the user (again, if multiple users are using the system, this can be useful to distinguish one users’s activity from another’s)
- If multiple processes may be in use, the process ID, and if multiple threads may be used, the thread ID (similar to the user ID, this can be useful to distinguish one user’s activity from another)
- As noted above, a unique code for each application-level message (error, warning, information, etc.) – this is important both for external monitoring tools, but can be useful for filtering entries while analyzing a log file.
- When an error is returned or an exception thrown from an API call, the error number and/or the message returned.
- The stack trace (this can tell you a lot about what the application was doing at the time of the error.)
- When appropriate, the actual data being passed to the API or other method. (So, for example, instead of merely reporting an error trying to instantiate a site, list or item object, provide the exact URL of the site, list or item – this can help to expose either subtle logic flaws, or configuration errors. Let’s say that the user configured a URL without a trailing slash when it was needed – the absence of the slash could show up in the data passed to the SPSite constructor. Or if an active directory group name is shown as a long number, it could expose the fact that the code failed to translate the group’s SID into a name.)
No comments:
Post a Comment