How do we diagnose and fix system performance problems?

While working with systems in production, it is common to encounter system performance problems that must be analyzed and solved. In general, those problems are complex, difficult to replicate issues that occur occasionally and may be circumstantial. When a few days go by without their occurrence, we even build our hopes up in the belief that those problems will not come up again.

What should we do to attack these problems?

We usually have two supplementary approaches: to attack the problem directly or to carry out proactive tasks.
In order to attack the problem directly, we must observe a method where the various actors (engineers, development, DBA, the individual in charge of the operating system, the individual in charge of communications, the individual in charge of web server, etc.) work together towards obtaining the information necessary to solve the problem.

One of the first things to deal with is ignoring the user’s subjectivity and work with accurate response times.

To this end, we may look at logs in the system, logs in the application, etc. If we lack all this, we can set a virtual user (an automated script) to execute a specific test case and then record the time used throughout, for example, a whole day. The purpose here is to define the hours of the day when the system slows down, and see how slow it becomes. A basic set of indicators for monitoring are then defined for each component in the system. Such monitoring must not intrude, so as to avoid worse times.

Once the information on times and indicators is available, we should start the analysis tasks, whose purpose is to guide us in order to detect the component that represents the bottleneck.

Otherwise, the more proactive approach would be to develop a performance project in advance, where we can simulate a specific load in an automated manner, and then carry out the system analysis in a similar way to what was described above. The advantage in this approach is that testing environment allows us more freedom to make more significant changes to the system in a consistent and repeatable manner. And we also have the possibility of simulating more load than the actual load (for instance, if additional users are going to be added to the system).

An important proactive action is to measure time when we execute functional tests, in addition to having some performance indicators (memory, CPU use, etc.), so that the functional testing process may show some of the “non-functional” problems.

These are very challenging projects, which constitute the type of project that the Abstracta team enjoys the most!