Tune your performance with graphs

Often when I come onto a project to help with performance problems, I am presented with statistics gathered about transaction timings. Most often, I will be presented with an average, sometimes with a more complete statistical picture which includes min/max/median/stddev.

Unfortunately, these numbers seldom tell me anything actionable which can be used to analyze the system for performance improvements. This is why the first thing I typically do is set these statistics aside and instead ask for the raw metrics data.

Once I have the data, I plot it using Excel. Why? Because the old cliche about a picture being worth a thousand words is right on. Or specifically in the context of performance tuning, plotting the data can often point out patterns and behaviors which can never be easily communicated with statistical summaries. Understanding and correcting these patterns can often lead to much improved performance.

Take for example the response graph below – the average response is 959, but this doesn’t really tell us about the interesting part, the fact that there are essentially two strata of performance times, the second clsutered around the 2000-3000 range. This information is masked by the average. It turns out that in the system I was looking at there appears to be an operating system bug which causes these spikes.

tune_performance_with_graphs1

Another example is this graph. Once again, typical statistical measurements are not very useful for seeing what is going on.

tune_performance_with_graphs2

Looking at the graph, we can see several interesting characteristics. First, you can see that the response times seem to increase in stages. Second, blowing up the plot to look at the details of the datapoints, you can see that the data points are stacking up, each subsequent point higher than the previous until a certain threshold, at which point the response drops and starts rising again. This is probably some type of queueing behavior in the system.

tune_performance_with_graphs3

So in summary – don’t rely on statistics alone. They will often mask interesting patterns which can cue you into performance problems. Plot your performance data instead and then drill into it, or zoom out. Manipulate it looking for unusual patterns. You will be surprised what you can find.

Some additional considerations

* Often the amount of data will overwhelm Excel. If you can use a statistical analysis package like SAS or SPSS, it can make your life a lot easier. If this is not an option, you can roll up your performance data (e.g. average for each second) but be aware that the more you average the data, the more the chance you are masking interesting patterns.

* I will often add a moving average trendline into the data. This can give you a visual cue of where the response times are while still showing you how the averages trend over time.

* You may ask yourself why the statistical measurements are not as effective. I think there are several reasons. Typically, statistics try to match the data to some type of known distribution – Poisson, Normal, etc… Often, performance data does not fit any of those, so standard statistical measurements are often warped or even meaningless. Secondly, it is important to remember that unlike many of the systems that statisticians examine, the performance data is not a fundamental characteristic of the system. As developers we can use the metrics to change to system to get the desired performance data points we are looking for.