Tuesday, November 12, 2013

Support


Yes, the topic of support covers a lot of ground – for this post, I’d like to focus on the things we should do as developers to be ready for support.

First, a reminder: we are imperfect beings in a messy and imperfect world.  That means that the wonderful, perfect application you are in the process of creating is going to fail, (unless it never gets used, but of course, that’s a different problem…).  To invoke a bit of Murphy’s law, it is going to fail at the worst possible time (probably just before the end of the last sprint of the next release), in the hands of your most critical customer, and in the most subtle and obscure way that can’t be recreated in-house.  And of course, because you wrote that code, they’re going to ask you to figure out what’s wrong.  So, what’s your strategy?  You should be selfish, and set yourself up to handle that call as quickly and as effectively as possible when it comes in.  The good news – your selfish objectives match those of your employer, and the end-user!

Let’s remember some of the challenges you and your compatriots in the front-line support team face:

Very often, the complaint from the customer takes the form of “it didn’t work” (to which you want to ask, what DID it do?), “the action never completed”, “I got an error message” (did you write it down or take a screen shot?).  Often the user doesn’t remember what they were doing or even when (to be fair, they weren’t expecting to have problems, so weren’t paying careful attention…).  In some cases (e.g. for national security agencies, or some military installations) you don’t get to watch the error, and/or they need to “sanitize” log files before sending them.  When the error is happening on a production system, usually your options for experimentation are limited; and in some cases, you need to wait as much as a couple of weeks for a deployment window to install patches or even diagnostic code.  And finally, every customer’s configuration is a bit different, and some of those differences create unique situations that can be hard to even notice.  (Examples: the customer that had 300 Active Directory Domains – for most actions, this was actually fine, until we needed to find a user that was no longer in his original domain –checking all 300 domains took a while…  Or the customer that blocked access to some branches of the Active Directory Hierarchy, but only for some accounts.) 

When thinking about how to anticipate and respond to issues, keep in mind that you have a number of audiences that need to know about the error:

  • The end user who was trying to do something – you owe him or her some information that the action or analysis didn’t go fully as planned.  You also need to be careful not to give him/her too much information (see the last post mostly talks about this, so I won’t get into this communication in this post).
  • The IT staff needs more information to help the user with his/her problem, and/or who needs to fix an underlying problem (a corrupt SharePoint site, a missing web part, missing permissions, …)
  • The operations staff needs to know about some kinds of errors (if the SQL server is down, or network connections aren’t being made, or your application has irretrievably crashed, they need to take action.  (Note that in some environments, the operations staff are separate from the IT staff, sometimes in a different location, and sometimes working for a different vendor.  One example of this is Microsoft Online Dedicated SharePoint environments – the operations staff are Microsoft data center people, but what I’m calling the IT staff above are customer employees (or contracted employees) with limited access to the operations staff.)
  • You!  (or your fellow developers) who may be called upon to diagnose and work around the problem.
Where do you capture this information?

For the operations staff, often they are using big expensive monitoring software (such as Microsoft’s System Center) to alert them when something needs their attention.  Generally, the best way to make information available to these tools is to place information into the Windows Event log.  In general, you want to create a separate source for your application, and unique event IDs for each distinct event – this allows the operations staff to create customized rules for alerting on specific events (they might decide that some events are worth waking people up at 3am for, and others can wait for business hours).  Also note that in general, you only want to write events that need attention by the operations staff (don’t flood them with more routine functional errors).  That suggests that you want to be able to make a distinction between kinds of errors based on the audience that needs to see them – more on this below).

For the customer-oriented IT staff, and for the developers, information could be written to tables in the database, or text files.  While there are benefits for each, my own preference is for plain old text files.  They are easy to purge, easy to compress, easy to send through email, and there are an assortment of tools for searching and filtering them (remember, grep isn't just for Unix weenies, and even Excel can be used effectively for sorting and filtering).  Two recommendations on text files: use an infrastructure that allows you to roll the files over based on dates and/or size (to keep them from growing indefinitely), and give the customer the option to place them on a non-default drive (some environments prohibit writing log files to the C drive, because of concerns they could consume all of the disk space on a critical drive.)  In a Microsoft SharePoint Online Dedicated environment, you have to write to ULS logs, which has some advantages (automatic space management, and some nice viewing applications) but some disadvantages (they can be VERY noisy and large…)

OK, what should you put into the log entries for the IT staff and yourself

First of all, you want to give yourself the ability to select the amount and type of detail to give you and support and the customer the freedom to deal with different situations without overwhelming *everyone* with too much information.  By default, you normally will capture only errors, but for troubleshooting situations, both SharePoint itself and ControlPoint adopted a two-dimensional model of settings – the first dimension selects the functional area of the application that needs investigation (e.g. active directory interactions, or dealing with metadata, or database interactions), and the second dimension chooses the level of detail (For ControlPoint, we supported None, Error, Warning, Information, Verbose – I would suggest sub-dividing the Error into two categories (Infrastructure Error, and Application Error – this lets you distinguish between the messages that need to go to the event viewer for consumption by the operations team (can’t connect to the SQL server), from the errors that merely elaborate on a report that didn’t run to completion.  The important distinction here is who needs to see the error (and therefore where it needs to be written).  So, the customer could choose to write infrastructure errors to the event log, but not application errors.

Finally, when you write an entry to the log file, what does it need to include?  The underlying principle is “whatever you’re going to need to sort out the problem”.  (Of course, when you’re writing the code, you don’t know what the problem will be, or you’d fix it before it happened…) 

I would include at least the following:

  • Date and Time (use the server’s time, not UTC – it will be a lot easier to find entries)
  • If it is possible, what action the user was engaged in (this is important for two reasons – first, the user many not even remember what s/he was doing, and many different users might be using the system at the same time).
  • If it is possible, the name of the user (again, if multiple users are using the system, this can be useful to distinguish one users’s activity from another’s)
  • If multiple processes may be in use, the process ID, and if multiple threads may be used, the thread ID (similar to the user ID, this can be useful to distinguish one user’s activity from another)
  • As noted above, a unique code for each application-level message (error, warning, information, etc.) – this is important both for external monitoring tools, but can be useful for filtering entries while analyzing a log file.
  • When an error is returned or an exception thrown from an API call, the error number and/or the message returned.
  • The stack trace (this can tell you a lot about what the application was doing at the time of the error.)
  • When appropriate, the actual data being passed to the API or other method.  (So, for example, instead of merely reporting an error trying to instantiate a site, list or item object, provide the exact URL of the site, list or item – this can help to expose either subtle logic flaws, or configuration errors.  Let’s say that the user configured a URL without a trailing slash when it was needed – the absence of the slash could show up in the data passed to the SPSite constructor.  Or if an active directory group name is shown as a long number, it could expose the fact that the code failed to translate the group’s SID into a name.)

No comments:

Post a Comment