Mike Conigliaro

A Scientific Method for Troubleshooting

I think there’s a new heavyweight champion in the world of Mike’s IT-related pet peeves. It’s called “just trying a bunch of random things until my problem goes away.” This is in no way related to troubleshooting, which I will define as “uncovering the root cause of an issue and then resolving it deliberately.” While there are many different techniques for troubleshooting specific problems, I’m going to attempt to show how the scientific method can provide a common framework for more effective troubleshooting.

Step 1: Describe the problem

This is the easy part. Find out what the problem is and reproduce it. That 2nd part is important, because the typical user can’t always be trusted to know what they’re doing. By reproducing the problem, you can rule out user error and verify that there actually is a problem.

Step 2: Gather and analyze data

This is the part everyone likes to skip, and is the real subject of this rant. Step two requires direct observations in order to find out exactly what is happening. How you do this is very much dependent on the problem at hand, but it involves things like log and packet analysis (which may very well require a 3rd party tool not included within the base OS). In any case, please don’t just take a wild guess about what’s causing the problem and then proceed down the path of random experimentation. I don’t know for sure how people acquire this bad habit, but I have a strong suspicion that it comes from working in Microsoft Land, where the computers have personalities, rebooting is the inexplicable fix for everything, and a sea of GUIs makes it easy for the novice sysadmin to miss what’s really going on.

Part of problem in Microsoft Land is that proper troubleshooting tends to be a lot more difficult than it needs to be. For example, there is a severe lack of useful diagnostic tools included within the Windows OS itself. Why are the Windows support tools, resource kit tools, and IIS diagnostic tools still separate downloads? The same question can be asked about the Sysinternals Suite (which Microsoft has owned for several years now). Why are they still shipping obsolete utilities instead of their newer replacements (e.g. nslookup, which was obsoleted by dig many, many years ago)? And lastly, why does Microsoft constantly try to hide any information that could be useful for troubleshooting? Anyone who has ever had to view the message headers on an email in Outlook knows exactly what I’m talking about here, but I digress.

Step 3: Form a hypothesis

It isn’t until you figure out what’s happening that you can address the question of why it’s happening. Step three is where you use the information you gathered in step two to determine a logical course of action. Remember that a hypothesis beginning with “maybe” or “I think” with little or no direct evidence to back it up is often a dead giveaway for someone who doesn’t know what the hell they’re talking about.

Step 4: Test your hypothesis

Perform your planned course of action.

Step 5: Analyze results and draw conclusions

Check to see if the issue is resolved. If not, revert your changes and go back to step three. When drawing conclusions, ask yourself how this problem occurred in the first place. Was your most recent fix permanent or just a temporary band-aid? If the fix was temporary, make sure you schedule a time to implement a permanent fix.

blog comments powered by Disqus