tag:blogger.com,1999:blog-61673408577528185392024-02-20T18:00:12.732-08:00StoriesAnonymoushttp://www.blogger.com/profile/11745338992886103222noreply@blogger.comBlogger10125tag:blogger.com,1999:blog-6167340857752818539.post-75114697004417157082014-11-02T17:38:00.000-08:002014-11-02T17:38:03.348-08:00Fault IntoleranceI recently read an article titled "Fault Intolerance" by Gerard Holzman, in the November issue of the IEEE magazine "Software" - the premise of the article is that more than software that tolerates faults, we need to create software that is <b>IN</b>tolerant of faults and aggressively defends itself against faults.<br />
<br />
On the way to his premise, he spoke about the push within academia to create software that can be formally proven to successfully satisfy the requirements of a formal specification. Now, I've always been a bit skeptical of the practicality of this objective, but Gerard gave me a bit of ammunition in this with the following observation:<br />
<blockquote class="tr_bq">
"The most difficult part of a formal specification, and the part that's most often incomplete, is to state clearly what should happen under various error conditions... Especially for complex systems, it can be very hard to forsee all possible off-normal events that a system might encounter in real life...Reality can be remarkably creative when it comes to showing where our imagination falls short" (And I would add end-users to the list...:-)</blockquote>
So, where does that bring us? "Fault intolerance leads to a defensive posture in software design in which nothing is taken for granted."<br />
<br />
And finally, he lays down the gauntlet with a standard: <br />
<blockquote class="tr_bq">
"For well written code, we should expect to see an assertion density of at least 2 percent, meaning that two out of every 100 executable statements are assertions"</blockquote>
The example the author uses throughout the article is software designed to operate semi-autonomous spacecraft such as the Cassini spacecraft (orbiting Saturn) and the Voyager spacecraft (which recently entered interstellar space). Now, most of us are not creating software for such demanding environments, and cannot afford the kind of double and triple redundancy that made these craft so reliable (the stamina of the Voyager is mind-boggling - to continue to operate with no physical modification for 37 years! What earth-bound computer is still operating after 37 years?).<br />
<br />
There was, however, a principle that I'd like to highlight and use as a metaphor: most of these systems are designed so that if a truly unique or unhandleable condition occurs, they revert to a safe mode - generally, this means to make sure the solar cells are pointed to the sun so that the batteries can be charged, and the antenna is pointed to earth so that they can get further instructions. It is important to note that the system does not just shut down - it continues to operate in a minimal way that allows for recovery.<br />
<br />
So, what does all of this mean to those of us building commercial systems? First, I will argue that the focus of fault intolerance, and such guidelines as an assertion density of 2 percent, still apply. While we may not be dealing with a billion-dollar spacecraft, we are dealing with demanding users who are trying to use our software to get a job done. And while some of the errors we have to deal with come from those very users, many of them come from random events in the universe (gamma radiation, sun spots, quirky configurations of an ancient single-sign-on service that interferes with our perfect software.) Our users expect our software to continue to operate, even if we have to go into safe mode to do so. <br />
<br />
So, what does safe mode mean for a commercial system? First, to the extent possible, continue to operate in degraded mode. Second, when you detect an extreme unexpected input or situation, go into safe mode gracefully and safely (e.g. don't write random data to the most critical database in your system - you want to be able to continue to operate after the error conditon is corrected). And third, continue to operate at least enough to tell the user that something is wrong (and in fact, tell them to contact their system administrator, who might not yet be aware of the problem.)Anonymoushttp://www.blogger.com/profile/11745338992886103222noreply@blogger.com0tag:blogger.com,1999:blog-6167340857752818539.post-78168957387808124442014-03-10T13:27:00.001-07:002014-03-10T13:31:48.848-07:00Simplicity, or the art of NOT doing things<br />
<div style="line-height: normal; margin: 0in 0in 8pt;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">I happen to be particularly
enamored of simplicity in art: the uncluttered style of oriental graphic art,
the enigma of a Koan, the spare beauty of Haiku and the purity of a cappella
voices.</span></div>
<div style="line-height: normal; margin: 0in 0in 8pt;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">Carrying that aesthetic to our
chosen profession, we find an increasing focus on simplicity in software
engineering.<span style="mso-spacerun: yes;"> </span>One of the core principles
that is part of the <a href="http://agilemanifesto.org/" target="_blank">Agile Manifesto</a> reads “Simplicity – the art of maximizing
the amount of work not done – is essential”.<span style="mso-spacerun: yes;">
</span>The lean movement puts additional emphasis on this.<span style="mso-spacerun: yes;"> Finally, t</span>he admonition YAGNI (“You Aren’t Going to
Need It”) is at its core an appeal for simplicity.</span></div>
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">So, why do we need all of this attention on a simple principle?<span style="mso-spacerun: yes;"> </span>Because Simplicity is HARD and doesn’t come
naturally to most western minds.</span><br />
<div class="MsoNormal">
</div>
<ul>
<li><span style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">As software engineers, we’re paid to write code and we’re
good at it, so our instincts (and/or our habits) are to write more of what we’re
good at.</span></li>
<li><span style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: 11pt; line-height: 107%;">Our product owners have very long lists of
features and requirements that will make our products richer and more competitive.</span></li>
</ul>
<br />
<div style="line-height: normal; margin: 0in 0in 8pt;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">But the cost of complexity is
insidious (and I’m focused on both aspects of the word: both evil and not
easily noticed):</span><br />
<div class="MsoNormal">
</div>
<ul>
<li><span style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">More code paths (one of the measures of complexity) represent
more unit tests, more test cases, and more combinations and permutations.</span></li>
<li><span style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">More code paths result in higher entropy: that conditional
execution that is so minor and obvious today will not be so obvious and will be
forgotten 2 years from now when you’re doing maintenance.</span><span style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;"> </span><span style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">The additional code paths greatly increase
the chance that the bug fix or enhancement 2 years hence will break something</span></li>
<li><span style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;">When complexity finds its way into the user interface, it
increases the power and flexibility of the application, but at the expense of
usability.</span><span style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif;"> At best, the users grumble a
bit, at worst, they stop using the product.</span></li>
</ul>
<br />
<div class="MsoNormal">
<span style="font-size: 11pt;"><span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">So, let me offer a few
recommendations:</span></span></div>
</div>
<div style="line-height: normal; margin: 0in 0in 8pt;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">At the user interface, find a way
to collapse a large number of high-level choices into a smaller number.<span style="mso-spacerun: yes;"> </span>For example, many text processing applications
will have menu items for Search, Global Search, Find in Files, Replace, Replace
Globally.<span style="mso-spacerun: yes;"> </span>In the end, those can all be
handled by one Find function, keeping the top-level set of choices much
smaller and more approachable.</span></div>
<div style="line-height: normal; margin: 0in 0in 8pt;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">In Axceler’s ControlPoint, we
implemented Site-level permissions reports sorted by Site, Site-level
permissions reports sorted by User, Comprehensive (detailed) permissions
reports sorted by Site, and Comprehensive (detailed) permissions reports sorted by User.<span style="mso-spacerun: yes;"> </span>With the clarity of hindsight, these would have
benefitted from collapsing these into a single high-level Permissions Report,
with a sort option, and a level of detail option.<span style="mso-spacerun: yes;"> </span>(Now, a reality check on this: at some
point, pushing too many choices down into detailed options on a function makes
that function suffer from the complexity bug, so don’t carry it too far!<span style="mso-spacerun: yes;"> </span>However, at the more detailed level,
you have more choices to manage complexity through techniques such as “Advanced” options so you
only see the complexity when you need or want it).</span></div>
<div style="line-height: normal; margin: 0in 0in 8pt;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">When considering boundary
conditions, error conditions, and special cases (these have a way of surfacing
in bug reports, so the discussion tends to happen in that context), it is
tempting to create special validation or special handling for each unique
situation.<span style="mso-spacerun: yes;"> </span>Now, in some cases, that is
appropriate, but you should resist the temptation: often the long term
compounding effect of lots of little bits of complexity is greater than the
incremental benefit of a uniquely tailored message or handler.<span style="mso-spacerun: yes;"> </span>My last blog post actually provides a good
example of this.<span style="mso-spacerun: yes;"> </span>Apple chose to handle
both “too hot” and “too cold” conditions in the same boundary condition
handler.<span style="mso-spacerun: yes;"> </span>I believe that this was an excellent
choice that kept the logic simpler.<span style="mso-spacerun: yes;"> </span>The
only shortcoming (and a minor one at that) was that the message verbiage only
acknowledged the most likely of the two possible conditions!</span></div>
<div style="line-height: normal; margin: 0in 0in 8pt;">
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">In closing, I’d like to quote a
man who bridged the gap between art and technology (namely Leonardo daVinci) “Simplicity
is the ultimate sophistication”.</span></div>
Anonymoushttp://www.blogger.com/profile/11745338992886103222noreply@blogger.com0tag:blogger.com,1999:blog-6167340857752818539.post-21563519707898717312014-01-25T08:46:00.001-08:002014-01-25T08:46:28.050-08:00Error messages that don’t make sense<div class="MsoNormal">
This time just an amusing observation…<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
As software engineers, we all struggle with the challenge of
writing error messages that are meaningful and helpful, and from time to time,
we all stumble into amusing blunders.
Even the king of good user interfaces, Apple, can fall victim:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
I have an old iPhone 3 that I’ve been using essentially as an
iPod, and I left it in my car. One cold,
sub-freezing morning recently I tried to start the iPod up, only to get the
message:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal" style="text-align: center;">
<span style="font-size: x-large;">Temperature</span><o:p></o:p></div>
<div class="MsoNormal" style="text-align: center;">
<span style="font-size: x-large;">!</span><o:p></o:p></div>
<div class="MsoNormal" style="text-align: center;">
iPhone needs to cool down<o:p></o:p></div>
<div class="MsoNormal" style="text-align: center;">
before you can use it.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Of course, I ignored the instructions and warmed it up, and
the message went away <span style="font-family: Wingdings; mso-ascii-font-family: Calibri; mso-ascii-theme-font: minor-latin; mso-char-type: symbol; mso-hansi-font-family: Calibri; mso-hansi-theme-font: minor-latin; mso-symbol-font-family: Wingdings;">J</span><o:p></o:p></div>
<div class="MsoNormal">
<span style="font-family: Wingdings; mso-ascii-font-family: Calibri; mso-ascii-theme-font: minor-latin; mso-char-type: symbol; mso-hansi-font-family: Calibri; mso-hansi-theme-font: minor-latin; mso-symbol-font-family: Wingdings;"><br /></span></div>
<br />
<div class="MsoNormal">
<i>(I never knew the iPhone had a
temperature sensor in it – apparently it’s not available to app developers…)</i><o:p></o:p></div>
Anonymoushttp://www.blogger.com/profile/11745338992886103222noreply@blogger.com0tag:blogger.com,1999:blog-6167340857752818539.post-28294946175015455532013-12-28T19:34:00.000-08:002013-12-28T19:34:44.589-08:00How to approach support issues<div class="MsoNormal">
Note that this post addresses only some of the issues that
arise in large, complex, and long-running support incidents; it generally focuses on the process more than
the specific content of a particular issue.
From that process perspective, I’ve seen support issues struggle with three
broad aspects of the issue:<o:p></o:p></div>
<div class="MsoListBulletCxSpFirst">
</div>
<ul>
<li>You get a large amount of information, and you
don’t know what bits of information are pertinent.</li>
<li>You <b><i>don’t</i></b> get one or several bits of
information that you asked the customer for.</li>
<li>The incident drags on interminably, and never
seems to get to closure.</li>
</ul>
<br />
<div class="MsoNormal">
<b>Too much information</b></div>
<div class="MsoNormal">
At one level, you’d rather have too much information than
too little. That’s why, when generating
log files, we tend to turn on every class of tracing that could possibly have a
bearing on the issue. For now, I won’t
talk about the process of extracting meaning from huge and noisy log files
(there’s a recorded session on that in the training directory at
Axceler/Metalogix). Rather, I’m talking
about elements of behavior or the environment that might or might not be
significant. Two examples:<o:p></o:p></div>
<div class="MsoListBulletCxSpFirst">
</div>
<ul>
<li>We were working on a problem with a job that
worked properly when run interactively, but not when run as a scheduled
job. We noticed that every time the
scheduler woke up, it would run through the “is there anything waiting to run”
assessment several times. We weren’t
sure why, and didn’t know whether to first on trying to figure that out, or start
with other aspects of the problem.</li>
<li>Another customer was having uncharacteristically
slow performance in a task that looked for permissions granted to users who
were no longer in Active Directory. In
the course of discussions, it became clear that the customer had many hundreds
of active directory domains. We weren’t
sure whether that might be related, and an incidental observation.</li>
</ul>
<!--[if !supportLists]--><o:p></o:p><br />
<div class="MsoListBulletCxSpLast">
<o:p></o:p></div>
<div class="MsoNormal">
The <b><i>art</i></b> of support is figuring out what
bits of the environment, or which unexpected behaviors are worth spending time
to explore, and which should be ignored.
As you make that judgment, consider a couple of issues:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Initially, think about what makes logical sense. Using the first example above, do you expect
that having the scheduler run several times should have anything to do with
what looked like a permission problem?
If the behavior or element of the environment doesn’t feel like it
should be related, then temporarily ignore it and look elsewhere. (But of course, be prepared to come back to
this if other factors get ruled out – remember Sherlock Holme’s aphorism: “Once
you have eliminated everything that is impossible, whatever is left, however
improbable, has to be the explanation”).
[ Historical note: we decided that the multiple runs of the scheduler
was not the source of the problem. In
the end, we determined that it was a relatively low impact configuration
problem that had no impact on the significant issue we were trying to solve. ]<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Also, ask yourself, if that behavior had anything to do with
the problem you are exploring, would you have seen any other symptoms? In the second example, if the number of active
directory domains had anything to do with the performance of the problem
report, would we have seen this at other customers, or would we see some kind
of impact on other reports at this customer.
Often this line of reasoning can lead to other questions (e.g. Do we
have any other customers with hundreds of Active Directory Domains? How long does it take to display the members
of an active directory group in a different domain? Perhaps we should turn on the logging for the
code that accesses active directory.) [
Historical note: the number of domains turned out to be the issue – connecting to
one domain controller has a modest cost, but multiplied by 300 it adds up
significantly. ]<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
There are two big risks for this line of reasoning:<o:p></o:p></div>
<div class="MsoListBulletCxSpFirst">
</div>
<ul>
<li>Beware of asking the customer too many questions
– in a long-running support call, you run the risk of exhausting the customer’s
good will – save that good will for the most important questions. </li>
<li>Also, as noted below, be very careful to notice
whether you are going down a rat hole.</li>
</ul>
<!--[if !supportLists]--><o:p></o:p><br />
<div class="MsoListBulletCxSpLast">
<o:p></o:p></div>
<div class="MsoNormal">
<b>You don’t get all of
the information you ask for<o:p></o:p></b></div>
<div class="MsoNormal">
The first recommendation is to make sure you really need the
information you asked for. Note that in
some cases, the customer may not have answered as a simple oversight, but in
many cases, the customer has not answered because they don’t <b><i>know</i></b>
the answer and/or don’t know how to get the answer. Insisting on the answer to a question can use
up good will, so before asking again, think about whether it was important or
whether it was an “if you happen to know…” question.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
If it <b><i>was</i></b> important, then ask the question
again, but perhaps try to phrase it differently (maybe they didn’t understand
the question) and/or tell them how to answer (check this setting, look in this
log file, send me a screen shot of this screen…)<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Sometimes explaining <b><i>why</i></b> you need the information can
motivate them to spending a bit of extra effort to find the answer. Note that to some extent, explaining why you
are asking will help to engage the customer, help to build a relationship and
reduce the erosion of good will.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Finally, sometimes it can help to explain what you are
looking for in general terms, and then ask them if they know anything that
would shed any light on the problem. For
example, in the first example above, we could ask “Can you tell if the timer
job itself is running repeatedly”, “Are you aware of any other timer jobs that
run repeatedly”. (Note that this
approach tends to be a long-shot, but every now and then, a Hail Mary pass
connects and wins the ball game <span style="font-family: Wingdings; mso-ascii-font-family: Calibri; mso-ascii-theme-font: minor-latin; mso-char-type: symbol; mso-hansi-font-family: Calibri; mso-hansi-theme-font: minor-latin; mso-symbol-font-family: Wingdings;">J</span>
)<o:p></o:p></div>
<div class="MsoNormal">
<b><br /></b></div>
<div class="MsoNormal">
<b>The incident drags on
and doesn’t reach a conclusion.<o:p></o:p></b></div>
<div class="MsoNormal">
First, let me note that when you’re in the middle of a
long-running support call, you tend to be focused on the last answer you got,
and the next thing to try – it can be <b>very</b>
hard to notice that the call has gotten bogged down. This is where it is very good if the support
team has an automatic rule for escalating issues that remained open for some
period of time – use that escalation to think about whether you should be doing
something differently.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Second, note that it is VERY easy for a long-running support
case to go down a rat-hole without even noticing it – you ask for a bit of
information that would be nice, but not essential, and if the customer has a
problem getting that, then finding that tidbit of information can turn into a
project of its own, even though it may not be the most important expenditure of
time, effort and customer’s good will.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
So, when a case drags on, periodically, it is useful to
regroup. Do this at least figuratively,
but very often literally bringing the players together in a room. In that regrouping, ask the following
questions (and write the answers down):<o:p></o:p></div>
<div class="MsoListParagraphCxSpFirst" style="mso-list: l1 level1 lfo3; text-indent: -.25in;">
</div>
<ul>
<li><span style="text-indent: -0.25in;">What </span><b style="text-indent: -0.25in;"><i>do</i></b><span style="text-indent: -0.25in;"> we know (include in this things
that you </span><b style="text-indent: -0.25in;"><i>think</i></b><span style="text-indent: -0.25in;"> are unimportant, but you’re not 100% sure of the significance
of.)</span></li>
<li><span style="text-indent: -0.25in;">What </span><b style="text-indent: -0.25in;"><i>don’t</i></b><span style="text-indent: -0.25in;"> we know that we wish we did
(and don’t limit this to things you think you can get the answer to – as you
brainstorm, someone may think of a way to answer the question.)</span></li>
<li><span style="text-indent: -0.25in;">What </span><b style="text-indent: -0.25in;"><i>have</i></b><span style="text-indent: -0.25in;"> we tried (or questions we’ve
asked) and what were the results/answers.</span><span style="text-indent: -0.25in;">
</span><span style="text-indent: -0.25in;">(Among other things you want to avoid going over the same ground again with
the customer; this step is particularly important if new people have joined the
incident team</span></li>
<li><span style="text-indent: -0.25in;">What </span><b style="text-indent: -0.25in;"><i>haven’t</i></b><span style="text-indent: -0.25in;"> we tried yet (in a sense,
this is a variation of what we don’t know…)</span></li>
</ul>
<!--[if !supportLists]--><o:p></o:p><br />
<div class="MsoListParagraphCxSpMiddle" style="mso-list: l1 level1 lfo3; text-indent: -.25in;">
<o:p></o:p></div>
<div class="MsoListParagraphCxSpMiddle" style="mso-list: l1 level1 lfo3; text-indent: -.25in;">
<o:p></o:p></div>
<div class="MsoListParagraphCxSpLast" style="mso-list: l1 level1 lfo3; text-indent: -.25in;">
<o:p></o:p></div>
<div class="MsoNormal">
Treat this process of collecting information as a sort of
brainstorming session. Part of the value
of this is to notice:<o:p></o:p></div>
<div class="MsoListParagraphCxSpFirst" style="mso-list: l2 level1 lfo2; text-indent: -.25in;">
</div>
<ul>
<li><span style="text-indent: -0.25in;">Things that might be significant, that you hadn’t
seen before</span></li>
<li><span style="text-indent: -0.25in;">Things that you haven’t tried yet (or questions
you haven’t asked), but should.</span></li>
<li><span style="text-indent: -0.25in;">That you have gone down a rathole, or are
focused on an aspect of the problem that isn’t the most fruitful avenue.</span></li>
</ul>
<!--[if !supportLists]--><o:p></o:p><br />
<div class="MsoListParagraphCxSpMiddle" style="mso-list: l2 level1 lfo2; text-indent: -.25in;">
<o:p></o:p></div>
<div class="MsoListParagraphCxSpLast" style="mso-list: l2 level1 lfo2; text-indent: -.25in;">
<o:p></o:p></div>
<br />
<div class="MsoNormal">
BTW: at the end of this, you may be inclined to ask the
customer a question you’ve asked before, either because you didn’t trust the
answer, or because the answer wasn’t complete.
As mentioned above, you should avoid this if possible, as it tends to
use up good will, but if it is necessary, it can help if you acknowledge you’re
repeating the question: “I know we asked this before, but we want to
double-check that we heard correctly”, or “... we wanted to double check one
aspect of …”<o:p></o:p></div>
Anonymoushttp://www.blogger.com/profile/11745338992886103222noreply@blogger.com0tag:blogger.com,1999:blog-6167340857752818539.post-43713919507633250072013-12-02T08:02:00.001-08:002013-12-02T08:02:37.401-08:00How do we think?
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">No, I'm not talking about how our neurons communicate.<span style="mso-spacerun: yes;"> </span>Nor am I talking about <strong>what</strong> we think about
(there are plenty of blogs on that topic).<span style="mso-spacerun: yes;">
</span>Rather, I want to discuss the mental <strong>process</strong> of moving from an
elementary idea or an ill-understood problem to a well-structured proposal or
solution.<span style="mso-spacerun: yes;"> </span>Note that I think that we can
and should use a very similar process in at least two somewhat different circumstances:</span></span></div>
<ul>
<li><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">Elaborating on a very high-level, briefly expressed, requirement into a more complete definition that is sufficient for implementation.</span></span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">Understanding the details and underpinnings of a bug sufficiently that you can define what to do about it.</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span></div>
</li>
</ul>
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">Even though this blog is directed at engineers, I suggest
that we would do well to keep the principles of the scientific method in mind.<span style="mso-spacerun: yes;"> </span>What do I mean by that?<span style="mso-spacerun: yes;"> </span>Well, when we look at the incredible
simplicity and beauty of formulas like F=ma, E=IR, E=mc**2, it is very tempting
to think that these laws sprang fully formed from the brains of the authors.<span style="mso-spacerun: yes;"> </span>As brilliant as Newton, Ohm, and Einstein
were, the reality is much messier than that.<span style="mso-spacerun: yes;">
</span>First of all, in the broad sweep of history, it took generations to get
past the idea of Earth, Air, Fire, and Water to get to the point where we could
even contemplate F=ma.<span style="mso-spacerun: yes;"> </span><span style="mso-spacerun: yes;"> </span>(Newton said, "if I have seen far, it is
because I stood on the shoulders of giants").<span style="mso-spacerun: yes;"> </span>At the individual level, even those geniuses
took many years to fully understand what was going on and to fully distill that
understanding to a simple and coherent statement of truth.</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span><br />
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">So, my first message: give yourself permission to do what
Newton and Einstein did - i.e. give yourself permission not to completely
understand the problem at the beginning of your exploration.</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">OK, so what does that mean in practice?<span style="mso-spacerun: yes;"> </span>Just as most modern software development life
cycle methodologies build upon the principle of iterative refinement, our
thinking process should do the same.<span style="mso-spacerun: yes;"> </span>Don't
try to come to a conclusion too quickly - give yourself permission to more
completely understand the problem space, and to get an idea of what is going on
and what you even need to observe, before you try to formulate a rigorous
experiment and/or to formulate a coherent proposal.</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">In the language of science and mathematics, you need to do
some somewhat more unstructured playing and observing before you can:</span></span></div>
<ul>
<li><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">Design the experiment (determine what steps/operations you need to perform).</span></span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">Decide what the independent variables are (these are the things that you want to vary in the process of the experiment - you are trying to determine what the impact of these variables is).</span></span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">Decide what the dependent variables are (these are the things that you want to make careful observations of during the experiment.</span></span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">In the end, come to a conclusion about causes and effects and what to do about them.</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span></div>
</li>
</ul>
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">I've been a bit abstract so far, so let me give some
examples, which I will draw from ControlPoint development.</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span><br />
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><b style="mso-bidi-font-weight: normal;"><span style="color: black; font-size: 12pt; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">Analyzing a bug</span></b><b style="mso-bidi-font-weight: normal;"><span style="font-size: 12pt; mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></b></span></div>
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">We had 2 or 3 customers complain that ControlPoint was
marking many (but not all) of the SharePoint users as having logged on
recently, even though they had not.<span style="mso-spacerun: yes;"> </span>(As
an interesting aside - Active Directory maintains this information, and these
customers were using this to identify inactive accounts that might need to be
shut down – ControlPoint’s behavior was interfering with that...)</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">Of course, the first step was to confirm that the
application really is causing this (that it wasn't something else in the
environment - cause and effect can be a slippery thing at times.). We were able
to recreate the problem, although not consistently, so we were comfortable that
we were at least triggering it.<span style="mso-spacerun: yes;"> </span>However
we didn't yet know why it happened on one server but not another.</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">So, the next question was whether the application was
actually doing this explicitly.<span style="mso-spacerun: yes;">
</span>Logically, this seemed unlikely - doing this explicitly would mean that
we would need to have the credentials for all of those accounts, which we
didn't have.<span style="mso-spacerun: yes;"> </span>And, empirically, we did
not find any code that obviously did a logon.</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p><span style="font-family: Calibri;"> </span></o:p></span><span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">This meant that this was probably a side effect of
something else we were doing, but what was that (and was there anything we
could do about it?)</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p><span style="font-family: Calibri;"> </span></o:p></span><span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">At about this time we realized (partly through logic and
partly through observation) that anytime that active directory recorded that
the user had logged in recently, the event viewer recorded a login event in the
security section.<span style="mso-spacerun: yes;"> </span>This gave us a much
more useful (in the sense of being more immediate) way of observing the effect
- we had now identified our dependent variable for experimentation.<span style="mso-spacerun: yes;"> </span>(Note that at this point, we are still
exploring, and aren't ready for a rigorous experiment yet, because we haven't
identified the independent variables, and don't yet have a hypothesis to
test.).<span style="mso-spacerun: yes;"> </span>This allowed us to confirm
something we suspected, that the problem was occurring somewhere in our nightly
discovery process.<span style="mso-spacerun: yes;"> </span>But what part of
discovery, and what specific action was causing the effect?</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p><span style="font-family: Calibri;"> </span></o:p></span><span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">We observed that from the perspective of ControlPoint,
there are a number of different user types, that we do different things with -
there is the user who is running ControlPoint, there are users who are
<strong>authorized</strong> to run ControlPoint, as a subclass of those, there are <strong>business
admins</strong>, and finally, there are the <strong>ordinary users of SharePoint</strong> who have been
given rights to SharePoint (but not ControlPoint).<span style="mso-spacerun: yes;"> </span>So, we formed the hypothesis that the type of
user may have been what was different between cases where we observed the
problem and didn't observe the problem - we now had the independent variable
for our experiment, and we were now ready to graduate from relatively
unstructured experimenting and observation to create a carefully structured
experiment.</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">The experiment took the following shape: create brand new
accounts (since in SharePoint existing accounts can be treated somewhat
differently by SharePoint) with the following characteristics:</span></span></div>
<ul>
<li><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">Farm admin who is also a ControlPoint (ordinary) admin</span></span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">Farm admin who is also a ControlPoint business admin</span></span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">Site Collection admin who is a ControlPoint (ordinary) admin</span></span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">Site Collection admin who is a ControlPoint business admin</span></span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">User with Full Control who is a ControlPoint business admin</span></span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">User with Full Control who has no rights to ControlPoint </span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span></div>
</li>
</ul>
<span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><span style="font-family: Calibri;">Having distinct and separate accounts, allowed us to clearly
identify the impact of the type of account – in other words, the account type
was an independent variable. <o:p></o:p></span></span><br />
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p><span style="font-family: Calibri;"> </span></o:p></span><span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">We ran discovery, and then observed which accounts
triggered a login event in the event viewer log.<span style="mso-spacerun: yes;"> </span>In the end, this allowed us to determine that
it was any user with rights to use ControlPoint, which in turn allowed us to
narrow it down to a WindowsIdentity system call, which we were using to determine
what groups the user was in so that we could determine what rights that user
might be getting from those groups. (It appears that the operating system is
doing an implicit impersonation of the user, which in turn amounts to a
login...)</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";"><span style="font-family: Calibri;">Armed with that knowledge, we were able to come up a
different mechanism to get the list of groups, and thereby avoid the
impersonation/login for each of the users.<o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0pt;">
<span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";"><span style="font-family: Calibri;">Note that the process above unfolded over the course of a
couple of weeks – clarity does not come in a flash (even Archimedes’ moment of
“Eureka” followed a lot of thinking!)<o:p></o:p></span></span></div>
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";"></span></span><br />
<span style="font-family: Calibri;"><span style="color: black; mso-bidi-font-family: Tahoma; mso-fareast-font-family: "Times New Roman";">As a side note: there is another non-technical process that
you should use here: give the problem some thought, explore the details and the
alternatives, and then intentionally set it aside, ideally at least
overnight.<span style="mso-spacerun: yes;"> </span>Your brain has a remarkable
background processor that works on problems while your attention is elsewhere –
when you come back to the problem, it is often a lot clearer than when you set
it aside.</span><span style="mso-bidi-font-family: "Times New Roman"; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></span><br />
<b style="mso-bidi-font-weight: normal;"><span style="font-size: 12pt; line-height: 107%;"><span style="font-family: Calibri;"></span></span></b><br />
<b style="mso-bidi-font-weight: normal;"><span style="font-size: 12pt; line-height: 107%;"><span style="font-family: Calibri;">Responding to requirements<o:p></o:p></span></span></b><br />
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Consider the requirement “We need to be able to duplicate
workflows”.<span style="mso-spacerun: yes;"> </span>As developers, our first objective
is to elaborate this into enough detail that we can know what we need to build,
that we can ensure that we have the same idea of what is needed as the product
owner, and that we can fairly accurately estimate the effort for this.<span style="mso-spacerun: yes;"> </span>Normally, that means that we need to
understand workflows enough to give the product owner some useful background,
to ask intelligent targeting questions, and ultimately to propose a set of
functionality that delivers useful value to the customer (the goal of agile, of
course) while providing value to the company.<o:p></o:p></span></div>
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Unless you are already an expert in SharePoint workflow, getting
to that understanding requires some time exploring, reading, and experimenting –
the result of that may be the understanding that the following factors affect
the duplicate functionality:</span></div>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Version of SharePoint (2007, 2010, 2013), and compatibility among versions (e.g. 2013 supports 2010 style workflows, but also supports an entirely new workflow architecture)</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Types of workflow (Out of the Box, individually defined, reusable)</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">The elements of the workflow, i.e. the definition, the association with a particular library, the instances of the workflow, the history list, the task list</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Versioning of the workflow definition (i.e. an older instance might still be running with an older version of the workflow rules, and a newer instance running with a revised set of workflow rules)<o:p></o:p></span></div>
</li>
</ul>
<span style="font-family: Calibri;">Given this increasing understanding of workflows, this
exercise is not going to culminate in a hypothesis and a rigorous experiment,
but it does lead to a step of increasing rigor in understanding and questions.<span style="mso-spacerun: yes;"> </span>So, at this point, we are ready to more
completely articulate what resources/artifacts are involved in a workflow, and
what resources/artifacts are shared among different workflows, and therefore need
consideration when duplicating workflows.<span style="mso-spacerun: yes;">
</span>We’re now prepared to have a discussion with the product owner of
whether the initial implementation might be limited to 2010 and 2007 style
workflows.<span style="mso-spacerun: yes;"> </span>And we can think about what
it means to duplicate a workflow that involves resources that are shared with
other workflows.<o:p></o:p></span><br />
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">What is common among those examples?</span></div>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Recognizing that you won’t always understand the problem space up front</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Recognizing that your understanding will improve iteratively</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Setting a goal of increasing the precision of your understanding with each iteration.<o:p></o:p></span></div>
</li>
</ul>
Anonymoushttp://www.blogger.com/profile/11745338992886103222noreply@blogger.com0tag:blogger.com,1999:blog-6167340857752818539.post-59287206136279603372013-11-25T03:53:00.003-08:002013-11-25T03:53:46.969-08:00HeisenbugsBefore Heisenberg started cooking meth, his namesake Heisenberg was a physicist known for the Uncertainty Principle, which says that you cannot know both the location and the momentum of a sub-atomic particle. This is related to the scientific principle called the Observer Effect, which states that the process of observing a system changes the system under observation. (This holds for systems ranging from sub-atomic particles to sociological studies to reality TV).<br />
<br />
So, what does this have to do with software? While the <strong><em>name</em></strong> Heisenbug is very tongue-in-cheek, it describes a very real problem that occurs occasionally, and is actually related to the observer effect: a Heisenbug is a bug that disappears as soon as soon as you start to analyze it. (In my mind, it is the second most annoying type of bug after the type that occurs only on premises at the million dollar customer who is so secure that they cannot even send log files.)<br />
<br />
Some examples of Heisenbugs that I've seen:<br />
<ul>
<li>A number of years ago at Digital Equipment Corporation we created a highly scalable, heavily multi-threaded mail server that supported both x.400 and SMTP/Mime. Somewhere in the code was a subtle threading problem that we presume had to do with locking between threads - it generated an error at only one customer (a major Air Force base). So, to find itm, we did what any software engineering group does - we added a lot of logging to tell us exactly where the error was occurring, and what lead up to it. The problem went away, we presume because the logging changed the timing of the locking just enough to avoid the race condition. After months of trying to fine tune the logging to get the information we needed without changing the behavior, we gave up and just ran the system with the logging enabled, and once a day asked the engineer on site to delete the log files...</li>
<li>More recently, at Axcler, we were fighting with a problem at one of those million-dollar customers and it was the only place the problem manifested itself. Again, we added more logging (since there was no option of running a debugger on site...), and again, the problem disappeared. In this case, we conjecture that the observer effect came from the fact that the debugging code, which was dumping out information about the SarePoint farm, may have initialized or reset the state of the SharePoint API, thus providing a valid return on something that the core application logic needed.</li>
</ul>
So, what can we do about Hiesenbugs? Unfortunately, there often is no good answer. Often the solution is an example of the messy side of software engineering in practice. More often than we would like, we have bits of logic in our production systems that are there just because they work and we don't always know exactly why. While we look at them and say "yuck" (or to use a different metaphor, they "smell"), the cost to figure out what is really going on may not make economic sense (do you want to satisfy your scientific curiosity, or would you rather implement a new feature?). Or if the problem occurs only at a customer site, the process of figuring out <strong><em>why</em></strong> may be just too annoying to the customer. Just as doctors will sometimes just treat a rash with cortisone without fully knowing what caused it, software engineers occasionally treat the symptom without explaining the cause...<br />
<div>
</div>
Anonymoushttp://www.blogger.com/profile/11745338992886103222noreply@blogger.com0tag:blogger.com,1999:blog-6167340857752818539.post-48940716923828726252013-11-12T13:46:00.003-08:002013-11-12T13:59:08.912-08:00Support<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Yes, the topic of support covers a lot of ground – for this
post, I’d like to focus on the things we should do as developers to be ready
for support.<o:p></o:p></span></div>
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">First, a reminder: we are imperfect beings in a messy and
imperfect world.<span style="mso-spacerun: yes;"> </span>That means that the
wonderful, perfect application you are in the process of creating is going to
fail, (unless it never gets used, but of course, that’s a different problem…).<span style="mso-spacerun: yes;"> </span>To invoke a bit of Murphy’s law, it is going
to fail at the worst possible time (probably just before the end of the last
sprint of the next release), in the hands of your most critical customer, and
in the most subtle and obscure way that can’t be recreated in-house.<span style="mso-spacerun: yes;"> </span>And of course, because you wrote that code,
they’re going to ask you to figure out what’s wrong.<span style="mso-spacerun: yes;"> </span>So, what’s your strategy?<span style="mso-spacerun: yes;"> </span>You should be selfish, and set yourself up to
handle that call as quickly and as effectively as possible when it comes in.<span style="mso-spacerun: yes;"> </span>The good news – your selfish objectives match
those of your employer, and the end-user!<o:p></o:p></span></div>
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Let’s remember some of the challenges you and your
compatriots in the front-line support team face:<o:p></o:p></span></div>
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Very often, the complaint from the customer takes the form
of “it didn’t work” (to which you want to ask, what DID it do?), “the action
never completed”, “I got an error message” (did you write it down or take a
screen shot?).<span style="mso-spacerun: yes;"> </span>Often the user doesn’t remember
what they were doing or even when (to be fair, they weren’t expecting to have
problems, so weren’t paying careful attention…).<span style="mso-spacerun: yes;"> </span>In some cases (e.g. for national security
agencies, or some military installations) you don’t get to watch the error,
and/or they need to “sanitize” log files before sending them.<span style="mso-spacerun: yes;"> </span>When the error is happening on a production
system, usually your options for experimentation are limited; and in some
cases, you need to wait as much as a couple of weeks for a deployment window to
install patches or even diagnostic code.<span style="mso-spacerun: yes;">
</span>And finally, every customer’s configuration is a bit different, and some
of those differences create unique situations that can be hard to even
notice.<span style="mso-spacerun: yes;"> </span>(Examples: the customer that had
300 Active Directory Domains – for most actions, this was actually fine, until
we needed to find a user that was no longer in his original domain –checking
all 300 domains took a while…<span style="mso-spacerun: yes;"> </span>Or the
customer that blocked access to some branches of the Active Directory
Hierarchy, but only for some accounts.)<span style="mso-spacerun: yes;"> </span><o:p></o:p></span></div>
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">When thinking about how to anticipate and respond to issues,
keep in mind that you have a number of audiences that need to know about the
error:<o:p></o:p></span></div>
<br />
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;"><b style="mso-bidi-font-weight: normal;">The end user </b>who was trying to do something – you owe him or her some information that the action or analysis didn’t go fully as planned.<span style="mso-spacerun: yes;"> </span>You also need to be careful not to give him/her too much information (see the last post mostly talks about this, so I won’t get into this communication in this post).</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;"><strong>The IT staff</strong> needs more information to help the user with his/her problem, and/or who needs to fix an underlying problem (a corrupt SharePoint site, a missing web part, missing permissions, …)</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;"><b style="mso-bidi-font-weight: normal;">The operations</b> staff needs to know about some kinds of errors (if the SQL server is down, or network connections aren’t being made, or your application has irretrievably crashed, they need to take action.<span style="mso-spacerun: yes;"> </span>(Note that in some environments, the operations staff are separate from the IT staff, sometimes in a different location, and sometimes working for a different vendor.<span style="mso-spacerun: yes;"> </span>One example of this is Microsoft Online Dedicated SharePoint environments – the operations staff are Microsoft data center people, but what I’m calling the IT staff above are customer employees (or contracted employees) with limited access to the operations staff.)</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;"><b style="mso-bidi-font-weight: normal;">You!</b><span style="mso-spacerun: yes;"> </span>(or your fellow developers) who may be called upon to diagnose and work around the problem.</span></div>
</li>
</ul>
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Where do you capture this information?<o:p></o:p></span></div>
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">For the operations staff, often they are using big expensive
monitoring software (such as Microsoft’s System Center) to alert them when
something needs their attention. <span style="mso-spacerun: yes;"> </span>Generally, the best way to make information
available to these tools is to place information into the Windows Event
log.<span style="mso-spacerun: yes;"> </span>In general, you want to create a
separate source for your application, and unique event IDs for each distinct
event – this allows the operations staff to create customized rules for
alerting on specific events (they might decide that some events are worth
waking people up at 3am for, and others can wait for business hours).<span style="mso-spacerun: yes;"> </span>Also note that in general, you only want to
write events that need attention by the operations staff (don’t flood them with
more routine functional errors).<span style="mso-spacerun: yes;"> </span>That suggests
that you want to be able to make a distinction between kinds of errors based on
the audience that needs to see them – more on this below).<o:p></o:p></span></div>
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">For the customer-oriented IT staff, and for the developers,
information could be written to tables in the database, or text files.<span style="mso-spacerun: yes;"> </span>While there are benefits for each, my own preference
is for plain old text files.<span style="mso-spacerun: yes;"> </span>They are
easy to purge, easy to compress, easy to send through email, and there are an
assortment of tools for searching and filtering them (remember, grep isn't just
for Unix weenies, and even Excel can be used effectively for sorting and
filtering).<span style="mso-spacerun: yes;"> </span>Two recommendations on text
files: use an infrastructure that allows you to roll the files over based on
dates and/or size (to keep them from growing indefinitely), and give the
customer the option to place them on a non-default drive (some environments
prohibit writing log files to the C drive, because of concerns they could
consume all of the disk space on a critical drive.)<span style="mso-spacerun: yes;"> </span>In a Microsoft SharePoint Online Dedicated
environment, you have to write to ULS logs, which has some advantages
(automatic space management, and some nice viewing applications) but some
disadvantages (they can be VERY noisy and large…)<o:p></o:p></span></div>
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">OK, what should you put into the log entries for the IT
staff and yourself<o:p></o:p></span></div>
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">First of all, you want to give yourself the ability to
select the amount and type of detail to give you and support and the customer
the freedom to deal with different situations without overwhelming *everyone*
with too much information.<span style="mso-spacerun: yes;"> </span>By default,
you normally will capture only errors, but for troubleshooting situations, both
SharePoint itself and ControlPoint adopted a two-dimensional model of settings –
the first dimension selects the functional area of the application that needs
investigation (e.g. active directory interactions, or dealing with metadata, or
database interactions), and the second dimension chooses the level of detail
(For ControlPoint, we supported None, Error, Warning, Information, Verbose – I would
suggest sub-dividing the Error into two categories (Infrastructure Error, and Application
Error – this lets you distinguish between the messages that need to go to the event
viewer for consumption by the operations team (can’t connect to the SQL
server), from the errors that merely elaborate on a report that didn’t run to
completion.<span style="mso-spacerun: yes;"> </span>The important distinction
here is who needs to see the error (and therefore where it needs to be written).<span style="mso-spacerun: yes;"> </span>So, the customer could choose to write
infrastructure errors to the event log, but not application errors.<o:p></o:p></span></div>
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Finally, when you write an entry to the log file, what does
it need to include?<span style="mso-spacerun: yes;"> </span>The underlying
principle is “whatever you’re going to need to sort out the problem”.<span style="mso-spacerun: yes;"> </span>(Of course, when you’re writing the code, you
don’t know what the problem will be, or you’d fix it before it happened…)<span style="mso-spacerun: yes;"> </span><o:p></o:p></span></div>
<br />
<div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">I would include at least the following:<o:p></o:p></span></div>
<br />
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">Date and Time (use the server’s time, not UTC – it will be a lot easier to find entries)</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">If it is possible, what action the user was engaged in (this is important for two reasons – first, the user many not even remember what s/he was doing, and many different users might be using the system at the same time).</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">If it is possible, the name of the user (again, if multiple users are using the system, this can be useful to distinguish one users’s activity from another’s)</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">If multiple processes may be in use, the process ID, and if multiple threads may be used, the thread ID (similar to the user ID, this can be useful to distinguish one user’s activity from another)</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">As noted above, a unique code for each application-level message (error, warning, information, etc.) – this is important both for external monitoring tools, but can be useful for filtering entries while analyzing a log file.</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">When an error is returned or an exception thrown from an API call, the error number and/or the message returned.</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">The stack trace (this can tell you a lot about what the application was doing at the time of the error.)</span></div>
</li>
</ul>
<ul>
<li><div class="MsoNormal" style="margin: 0in 0in 8pt;">
<span style="font-family: Calibri;">When appropriate, the actual data being passed to the API or other method.<span style="mso-spacerun: yes;"> </span>(So, for example, instead of merely reporting an error trying to instantiate a site, list or item object, provide the exact URL of the site, list or item – this can help to expose either subtle logic flaws, or configuration errors.<span style="mso-spacerun: yes;"> </span>Let’s say that the user configured a URL without a trailing slash when it was needed – the absence of the slash could show up in the data passed to the SPSite constructor.<span style="mso-spacerun: yes;"> </span>Or if an active directory group name is shown as a long number, it could expose the fact that the code failed to translate the group’s SID into a name.)<o:p></o:p></span></div>
</li>
</ul>
Anonymoushttp://www.blogger.com/profile/11745338992886103222noreply@blogger.com0tag:blogger.com,1999:blog-6167340857752818539.post-2708582415091873522013-11-06T19:53:00.003-08:002013-11-06T19:53:41.604-08:00GranularityIn my last post, one of the recommendations I made was to "Keep on Truckin'", in other words keep the application running and accomplishing as much as can be done, even in the face of failures and bad inputs in the environment. In this article, I want to explore that a bit more, and specifically talk about the granularity of the response to errors.<br />
<br />
But first, a side observation: Exception handing is one of the more wonderful things that Bjarne Stroustrup incorporated into C++ and that were adopted by derivative languages. While there is an overhead associated with Try..catch blocks, I would argue that the value of extensive and fine-grained use of exception handlers is well worth the relatively small cost. Fine-grained exception handling is a good thing.<br />
<br />
Which brings us back to granularity: my advice is to trap, handle, and recover from errors in the environment at the smallest practicable granularity. So, what does that mean? Of course it is hard to say in general ("practical" is so subjective!), but I would say that the trapping of errors and/or the location of the try...catch blocks should be at least at the level of the lowest-level item included explicitly in an analysis, or acted upon by an action.<br />
<br />
Some examples might help:<br />
<ul>
<li>In Outlook, suppose that I've got a .pst file for my email from the 1980's (of course, I converted it from my ccMail client...), and I'm searching for email from Bill Gates. Suppose the report of the standardization committee on the Basic language was somehow corrupted. If the exception handling is at the level of the .pst file, I might not find <strong><em>any</em></strong> messages from Bill, but if the exception handling is at the level of the item, I'd miss the committee report, but I might see the "Thanks for your valuable suggestion" (No, I never got one of those, but if I did, you can be sure I'd hang onto it :-) </li>
<li>In ControlPoint, we implemented a Comprehensive Permissions Report that can report on permissions right down to the individual list item level - in this case, if we are unable to read the permissions on an individual item, we do our best to report on the permissions of the site, the lists, and any other items that we are able to read.</li>
</ul>
Note that I've expressed this in terms of exception handling, but remember that we still have API methods that report information in return codes, and/or in return values (as in, a null object pointer may be a null result) - which brings me to a pet peeve: there is no <strong><em>good</em></strong> reason to have an "object reference not set to an instance of an object" error - if your code reports that error, you probably weren't checking a return value. And a recommendation to code reviewers: this is one of the things you should check for.<br />
<br />
OK, so you catch an exception or detect an null object reference at an appropriate level - now what do you do? It is worth keeping in mind that you actually have three audiences that need to know something about what happened: the end user who initiated the action, the IT staff that needs to know something went wrong, and the developer and/or support team responsible for the application - each of those wants something slightly different:<br />
<strong><u>User</u></strong><div>
Of course, you owe it to the user to tell him/her that something went wrong, and that the report s/he asked for is incomplete or that an action wasn't carried out as fully as the user expected. If the report or action has a status message, a completion message like "Completed with errors" is a good balance - you tell the user that results are available, but they are not 100%...</div>
<div>
</div>
But how do you communicate what is missing, e.g. which sites are not included in the tally of storage, which sites are missing the list of running workflows, etc. For reports, the best solution is to either replace the object name with a string like *** Error *** if you are unable to fetch the name of the object, or append a string to the name if you have the name, but not an important property value. Note that you want to make sure that the string is searchable - in a 3,000 page report, you can't rely on manual inspection to pick out the items with the errors - help the computer to do what computers do best!<br />
<br />
In actions, if the action has some form of task audit report (always a good idea!), then simply list the objects that weren't processed because of errors in the task audit report<br />
<br />
<strong><u>IT Staff / Developer / Support</u></strong><div>
While a lot of technical detail will cause most end users to glaze over, most IT staff can tolerate technical detail, even if they don't fully understand it. In other words, you can probably combine the information you want to communicate to the IT staff with the information you want to communicate back to yourself (as a quick preview of a later post - remember, what happens if you take a support call: *you* are one of the audiences you're writing this information for!).</div>
<div>
</div>
We'll talk about what information to capture and where in a later post, but for now, think about the information that the IT staff needs - I will mention a few things to be sure you include:<ul>
<li>What action was the user in the middle of (remember, the log file may contain information about a lot of different user's activities - clarifying the action can help to correlate the log entry to the user's action)</li>
<li>What object did the error occur on? (If the user ran a report on the entire farm, s/he won't know which site was the corrupt one - helping the IT manager find that site can enable them to do something about it.</li>
<li>The message returned from the API - often this contains exactly the information that the IT person needs to know. (E.g. to illustrate with a common problem - if the password changes on a service account, often the error message will tell the IT person exactly what the problem is (maybe even without calling vendor support!)</li>
</ul>
Note that there is a real dichotomy between the information you present to the user, and the information needed by IT - a couple of notes here: <ul>
<li>There are a small number of customers in highly secure environments (curiously, these tend to be in financial industries, rather then military or security agencies) for whom display of API-generated messages to end users is a security risk (since the API message may expose information about the system). You may need to accommodate those users. On the other hand, displaying messages from the API directly to users can often help debugging, both in QA, and when support is online with the customer and can watch what is happening on the screen, so it is very tempting. In ControlPoint, we introduced a customer-settable option that specified whether error details should be displayed to the user. </li>
<li>If you want to display detailed technical information to the user, it would be best to prefix it with something like "Please provide the following information to your help desk or IT department" - that gives the end user permission to not to try to understand it.</li>
<li>While it can be useful to display information about what source data gave rise to the error (e.g. the name of the site that the report couldn't report on), be careful that you don't inadvertently expose information. If the user asked for the storage consumed by the HR site collection, telling the user that the site \HR\Layoffs\AlfredENeumann is inaccessible might expose information the user shouldn't know.</li>
<li>In this vein, the model adopted by SharePoint starting in version 2010 of presenting the user with a sparse message containing a correlation ID works very nicely - the message seen by the user has very little information (maybe even too little), but the correlation ID allows the IT department to find the exact error message in the ULS logs - the only thing they missed was to warn the user that they'd actually need the correlation ID before they dismiss the dialog...</li>
</ul>
Anonymoushttp://www.blogger.com/profile/11745338992886103222noreply@blogger.com0tag:blogger.com,1999:blog-6167340857752818539.post-7276090636096594522013-11-04T07:48:00.001-08:002013-11-04T07:48:19.015-08:00Lessons from medical device softwareA number of years ago, I worked with someone whose husband had an automatic, implantable defibrillator to address heart problems.<br />
<br />
As a software engineer, my first reaction was "Wow - they had to get that <b><i>exactly </i></b>right". As a Software Development manager, my second reaction was to think about what lessons other less critical software projects could take from that. There are at least three that come to mind:<br />
<br />
<b>1) Some things need more care than others.</b><br />
<br />
That lesson can be applied at multiple levels - first, some applications are really critical (the automatic defibrillator could kill you if it fails; the navigation system for the Space Shuttle had to be exactly right or the astronauts wouldn't get home again...), but in contrast, a report on what class of users prefer Kleenex over the store brand isn't going to kill anyone if it is off by 10%.<br />
<br />
At another level, within an application, some <b><i>functions </i></b>are more critical than others. Windows XP used to (I guess it still does for anyone who's running it...) contain a neat defragmenter that started with a very colorful diagram showing you how badly fragmented the disk is, and then proceeded to move sectors around on disk to reduce the fragmentation. Now, if the report was wrong, frankly, most of us would never know, and probably wouldn't care a lot. On the other hand if it lost a sector on disk during the defragmentation, that could corrupt a really critical file (my status report to my bosss, the grant application for my foundation, my resume...). My hope and expectation is that the team creating that defragmenter spent a lot more time and care on the design, code reviews and testing of the defragmenation aspects of the program than they did on the display.<br />
<br />
<b>2) Keep on Truckin'</b><br />
<br />
(Google it, if you're not familiar with the phrase...)<br />
<br />
Back to the defibrillator - if it encounters an error (lets say the battery level drops momentarily, or the user gets too close to a microwave that temporarily messes up the sensor that detects the heart's contractions), the defibrillator can't exactly display "Error, unexpected input, please restart" - it has to ignore the spurious input, keep running, and wait for the error condition to clear itself (and if possible, clear the condition).<br />
<br />
So how does that apply to those of us who are more focused on enterprise software. It turns out that this is one of the hardest lessons for most application developers - the "Error, unexpected input" response is our first reaction, and it does have the advantage of alerting the user to a problem. But for highly available software (that is what you want to build, isn't it?) that's not the best response.<br />
<br />
Consider a mail server; while it normally isn't as life-critical as a defibrillator, mail has become a business critical function in today's world. If your mail server encounters an SMTP protocol error when Joe's Garage is trying to send a "your car is ready" message, you don't want the mail server to shut down and wait for Joe to get his act together. Instead, you want it to continue to listen for other email (the announcement that your company just won that million dollar deal, for instance. or the support question from a customer). If the disk defragmenter encounters a bad sector on a write, you'd like it to mark it as bad, and try another one; or if it encounters a bad sector on a read, you want it to put as much of the file together as it can. If the SharePoint search indexer encounters a corrupted site, you want it to continue to crawl all of the other sites, so that you can search for *most* of the content of the farm.<br />
<br />
Now, this needs a bit of care, which begins to overlap with my next lesson. Consider the disk defragmenter - if it gets 100 bad sectors in a row, that's probably an indication that the disk or the controller has gone bad in a big way - continuing to try to defragment the disk could very possibly make things a lot worse. If the mail server gets 20 errors in a row trying to send a message to Joe's Garage ("I'll pick it up at 6:00 pm"), it's probably a waste of time trying the 21st time.<br />
<br />
In short, if your "Keep on Truckin'" response is to retry an operation, you normally want an upper bound on the retries, or you could spend more time on the retries than on productive work, which in the end would defeat the "Keep on Truckin'" objective. While your at it, let your customer override your default number of retries (because someone is going to want a different threshold than the rest of your customers.)<br />
<br />
<b>3) "First do no harm"</b><br />
<br />
(While those words are not actually in the Hippocratic oath, the spirit of it is there...) In more mundane terms, we need to consider the failure mode of our software.<br />
<br />
Continuing with the example of the defibrillator - consider what the defibrillator should do when the battery level begins to drop or even if there is a power surge - you don't want the defibrillator to go into defib mode just because the inputs are out of range - you'd rather it go into quiet mode until things become normal.<br />
<br />
So, what does this mean for enterprise software? I'll use an example from Axceler's ControlPoint - the product includes a function to move SharePoint sites - it does this by exporting a copy of the site, importing that copy to the new location, and then as an optionally delayed operation, deleting the source site. First of all, if the import of the copy of the file to the destination has an error, we do not want to delete the source file. Similarly, the Move function has an option to take a separate backup before the move - if the backup has an error, then again, we do not want to delete the source file. In all cases, the response to errors (the failure mode) is to do nothing. While this is fairly obvious when stated this way, it can be easy to miss if you don't think about it while building the function.<br />
<br />
Finally, at the functional design level, there are some less obvious things that can be done. We implemented the delayed delete of the source partly to provide a validation and recovery option: if a problem occured with the destination, even a subtle one that the code could not detect, then the user has the the time to notice, and has the option to cancel the delete and revert back to the original site.<br />
<div>
<br /></div>
Anonymoushttp://www.blogger.com/profile/11745338992886103222noreply@blogger.com0tag:blogger.com,1999:blog-6167340857752818539.post-17256796169453950092013-11-04T07:08:00.003-08:002013-11-04T07:08:50.487-08:00Who am I, and why this blogI've been building software, generally enterprise-level COTS (Commercial Off The Shelf) software for longer than I care to admit. I've done it in some very large organizations (Wang Laboratories, and Digital Equipment Corporation) and some very small, 20 person organizations. I've had the chance to work with some brilliant people who showed me what to do right, and some folks who showed me things NOT to do. As I have tried to impart that learning to my teams, I tend to fall back on two things: stories and metaphors (and in the end, they're not all that different). <br />
<br />
So this blog is an opportunity to capture some of those stories, and occasionally metaphors, in the hopes that some folks will find it useful.Anonymoushttp://www.blogger.com/profile/11745338992886103222noreply@blogger.com0