Given the message logging engine is running Passed
When a production peak hour of messages is sent to the message logger at 5 times production throughput Passed
Then 95% of the messages should be logged within 30s Passed
And 100% of the messages should be logged within 60s Failed
And 100% of the messages should be logged successfully Skipped

Without seeing any of the technical details, it is pretty clear what this test is doing and what the part of the test that failed actually means.

For many years we have seen performance testing done by others with very little meaning and no real statistical analysis.

We have seen unclear performance testing requirements, load applied that had very little resemblance to production load, lots of effort going into measuring metrics that were irrelevant and almost meaningless charts of CPU usage and memory usage.

We were excited when we were asked to try our hand at it, because we knew we could create something efficiently which had actual value.

First, we needed clear performance requirements. No-one knew how the message logger should perform, so we proposed some requirements, based on its business purpose.

The message logger captures messages sent by the client’s integration platform and stores them in a database.

Operational staff query the message database to help them respond to incidents and diagnose problems with their integration platform. This process is critical to the smooth running of that platform.

In order for the support staff to efficiently do their jobs, the messages needed to be stored in the database within 30 seconds 95 per cent of the time, within 60 seconds 100 per cent of the time and every message needed to be successfully stored.

This information from staff gave us the acceptance criteria, which we put into our Cucumber-JVM feature file.

Next, we needed messages that were a good representation of a peak hour in production.

The client was using TIBCO Rendezvous (RV) as their message platform, so we created a little RV program that would listen for RV messages for slightly more than an hour and write them to file along with a timestamp.

We ran the program during a period that was historically the busiest hour in production.

We then created a Grinder script to replay the messages at a specified multiplier of the production throughput (e.g. 1/10th of production load, five times production load, etc).

Since the messages are fire-and-forget, we needed a way to record their processing time.

The client was using RV's Certified Message delivery (RVCM), which provides a "DELIVERY.COMPLETE" advisory when all the RVCM subscribers have confirmed the message. The message logger we were testing would confirm the messages after they were successfully stored in the database.

So, our script would send a message and wait for the "DELIVERY.COMPLETE".  The Grinder would record this delay as the response time.

We used many workers and threads working in parallel to maintain our desired throughput.

Calling net.grinder.Grinder.main from a Cucumber-JVM step does not work, because net.grinder.Grinder.main calls system.exit at the end of the run, killing our Cucumber-JVM test. However, we cannot call net.grinder.Grinder's constructor or run methods directly, because they are private.

This small workaround addressed these problems.

import net.grinder.Grinder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.lang.reflect.Constructor;
import java.lang.reflect.Method;
...
final Logger logger = LoggerFactory.getLogger("agent");
Constructor constructor = Grinder.class.getDeclaredConstructor(String[].class, Logger.class);
constructor.setAccessible(true);
Grinder grinder = constructor.newInstance(new String[] { "build/grinder.properties" }, logger);
Method run = Grinder.class.getDeclaredMethod("run");
run.setAccessible(true);
run.invoke(grinder);

With our Grinder script reading our Cucumber-JVM tests, the "When a production peak hour …" step simply kicked off The Grinder and waited for it to finish.

With The Grinder run complete, it was time to analyse the results.

The Grinder gives you basic summary statistics, like mean test time and test time standard deviation, but they are unlikely to be the statistics you are after.

Instead, we used OpenCSV to read all of test times from The Grinder’s data log files and Apache Commons Math to give us the 95th and 100th percentiles. If the 95th percentile was greater than 30 seconds, then that step would fail. If the 100th percentile was greater than 60 seconds, then that step would fail.

Finally, if there were messages which didn't get a "DELIVERY.COMPLETE" advisory, our Grinder script would treat that as an error, which would also be recorded in the data log file. If there were any errors, our last step would fail.

The client did ask: "What about CPU usage, memory usage, disk IO, network IO and database load? What about RVCM ledger sizes? Shouldn't we be measuring them as well?" It is a valid question and our answer was “yes and no”.

In regards to the test itself, if the system under test (SUT) overloads the CPU, disk or any other resource, it will be reflected in the message processing times. For instance, if the message logger cannot get enough CPU to process all the messages, they will back up in the RVCM ledger and their processing will be delayed, which we will see in the processing times.

If the message logger cannot get enough memory, it will throw errors, which means the message will not be confirmed, resulting in errors in the test runs.

If the database cannot get enough resources to store the messages fast enough, the message logger’s JDBC activities will take longer, which again will be reflected in the processing times.

Whatever part of the system cannot keep up with the message load, will be reflected in the processing time or as errors.

Specifically, the test tells us if the message logger does or does not meet its performance requirements, nothing more.

Adding extra responsibilities to the test, by measuring things like CPU usage and memory usage, to tell us why thelogger is not meeting the performance requirements, greatly increases development cost for little extra value.

When the SUT does not meet performance requirements, most of the time the bottleneck will be in an unexpected location. However, when this occurs , we should be able to diagnose why it failed using the same monitoring and alerts that are available in production.

If we need to add extra monitoring to determine why it is not meeting its performance requirements, what hope do the support staff have when a similar situation occurs in production?

So, yes, monitor CPU usage, memory, etc the same way you would in production, but do not measure them directly as part of your test. If the test fails, the data will be there when you investigate.

You could consider adding another step to your acceptance criteria along the lines of "And no high severity monitoring alerts are generated"; however given that our SUT meets its performance requirements, it is likely that such an additional requirement would result in false negatives, especially if the monitoring and alerting system has not been bedded down well.

The end result? A simple repeatable automated performance test with clear semantics.

Oh, and the reason the logger failed the 60-second threshold test? Database deadlocks, which only occur when the logger is under high load.

As it says in the title, “Performance.Testing.Heaven”.