IBM REDP-4285-00

Chapter 3. Analyzing performance bottlenecks 79

Draft Document for Review May 4, 2007 11:35 am 4285ch03.fm

The fact that the problem can be reproduced enables you to see and understand it better.

Document the sequence of actions that are necessary to reproduce the problem:

– What are the steps to reproduce the problem?

Knowing the steps may help you reproduce the same problem on a different machine

under the same conditions. If this works, it gives you the opportunity to use a machine

in a test environment and removes the chance of crashing the production server.

– Is it an intermittent problem?

If the problem is intermittent, the first thing to do is to gather information and find a path

to move the problem in the reproducible category. The goal here is to have a scenario

to make the problem happen on command.

– Does it occur at certain times of the day or certain days of the week?

This might help you determine what is causing the problem. It may occur when

everyone arrives for work or returns from lunch. Look for ways to change the timing

(that is, make it happen less or more often); if there are ways to do so, the problem

becomes a reproducible one.

– Is it unusual?

If the problem falls into the non-reproducible category, you may conclude that it is the

result of extraordinary conditions and classify it as fixed. In real life, there is a high

probability that it will happen again.

A good procedure to troubleshoot a hard-to-reproduce problem is to perform general

maintenance on the server: reboot, or bring the machine up to date on drivers and

patches.

򐂰When did the problem start? Was it gradual or did it occur very quickly?

If the performance issue appeared gradually, then it is likely to be a sizing issue; if it

appeared overnight, then the problem could be caused by a change made to the server or

peripherals.

򐂰Have any changes been made to the server (minor or major) or are there any changes in

the way clients are using the server?

Did the customer alter something on the server or peripherals to cause the problem? Is

there a log of all network changes available?

Demands could change based on business changes, which could affect demands on a

servers and network systems.

򐂰Are there any other servers or hardware components involved?

򐂰Are any logs available?

򐂰What is the priority of the problem? When does it have to be fixed?

– Does it have to be fixed in the next few minutes, or in days? You may have some time to

fix it; or it may already be time to operate in panic mode.

– How massive is the problem?

– What is the related cost of that problem?