Chapter 3. Analyzing performance bottlenecks 79
Draft Document for Review May 4, 2007 11:35 am 4285ch03.fm
The fact that the problem can be reproduced enables you to see and understand it better.
Document the sequence of actions that are necessary to reproduce the problem:
What are the steps to reproduce the problem?
Knowing the steps may help you reproduce the same problem on a different machine
under the same conditions. If this works, it gives you the opportunity to use a machine
in a test environment and removes the chance of crashing the production server.
Is it an intermittent problem?
If the problem is intermittent, the first thing to do is to gather information and find a path
to move the problem in the reproducible category. The goal here is to have a scenario
to make the problem happen on command.
Does it occur at certain times of the day or certain days of the week?
This might help you determine what is causing the problem. It may occur when
everyone arrives for work or returns from lunch. Look for ways to change the timing
(that is, make it happen less or more often); if there are ways to do so, the problem
becomes a reproducible one.
Is it unusual?
If the problem falls into the non-reproducible category, you may conclude that it is the
result of extraordinary conditions and classify it as fixed. In real life, there is a high
probability that it will happen again.
A good procedure to troubleshoot a hard-to-reproduce problem is to perform general
maintenance on the server: reboot, or bring the machine up to date on drivers and
patches.
򐂰When did the problem start? Was it gradual or did it occur very quickly?
If the performance issue appeared gradually, then it is likely to be a sizing issue; if it
appeared overnight, then the problem could be caused by a change made to the server or
peripherals.
򐂰Have any changes been made to the server (minor or major) or are there any changes in
the way clients are using the server?
Did the customer alter something on the server or peripherals to cause the problem? Is
there a log of all network changes available?
Demands could change based on business changes, which could affect demands on a
servers and network systems.
򐂰Are there any other servers or hardware components involved?
򐂰Are any logs available?
򐂰What is the priority of the problem? When does it have to be fixed?
Does it have to be fixed in the next few minutes, or in days? You may have some time to
fix it; or it may already be time to operate in panic mode.
How massive is the problem?
What is the related cost of that problem?