How Shutterfly masters continuous performance testing by “adjusting the belt tight”
Picture this, you are the owner of an e-commerce website and you want to be sure that its excellent customer experience doesn’t deteriorate over time with the introduction of new functionalities and that your customer retention rate remains healthy. For you, one critical pillar for achieving this is having reliable and highly performant web pages. As they say, your competitor is always one click away.
The same goes for Shutterfly, the go-to platform for creating custom photo creations and gifts (that has recently passed the $1 billion mark in revenue), whose performance engineering team is concerned with testing and monitoring its web and application performance on a daily basis. Making sure that each user smoothly sails through the entire process from uploading photos, to designing their creation, to checkout is of the utmost importance for the company to remain the market leader.
Above is a screenshot of a page that lets me preview my custom mug from the Shutterfly website. You can see how its powerful tool allows you to visualize almost any product you can think of with your photo(s) on it! Imagine all the work that goes into making sure this system works perfectly, all the time!
I have been fortunate enough to work with Shutterfly’s performance engineering team, assisting them in the past few months while learning about their cutting-edge continuous performance testing method. I’m thoroughly impressed by how this particular client has devised a way to test so that they can discover degradations in performance almost immediately!
Today I’d like to discuss how Shutterfly’s continuous performance testing methodology works so that you may also use it to your advantage.
The methodology involves a very different approach than that of having a load simulation for acceptance testing, in which we simulate the whole load scenario with a test infrastructure similar to the production environment. We are not substituting that for this, as this is just a way to complement those tests and discover degradations before the acceptance test. This way, we reduce the risk of having to implement big solutions (aka “solve BIG problems”) just a couple of days before the day we wanted to deliver our new version of the system.
Essentially, the goal is to be able to detect the exact moment when someone enters a line of code that impairs the performance of the system. This methodology I will explain provides a great way to do so and to also be able to automatically identify it as soon as possible, having a record that gives us a clue as to when it was inserted so that we can identify it easier.
So, how does it work?
Methodology
First off, a precondition that must be met is to perform tests frequently. Continuous integration would be best, running different sets of tests at different times: basic regression tests before each commit and a full regression at least 2 or 3 times a week. You must have load scenarios (acceptance testing) before putting a new version in production and tests ready for each fix. In Shutterfly’s case, engineers check in code around the clock and the software is updated in the test environment daily.
Secondly, we must make sure that we have an exclusive environment for tests. With an exclusive test environment, the results will be more or less predictable and won’t be affected by if another person, for example, runs something else at the same time, causing the response times to soar, generating false positives, and wasting a whole lot of time.
Another key aspect is that the tests should have acceptance criteria (assertions) as close as possible so that at the slightest system regression, before any negative impact, some validation fails, indicating the problem. This should be done in terms of response times and throughput.
I like to call this “adjusting the belt tight.”
Now I’ll explain how to create, maintain and use this “belt” in performance tests.
Imagine we have a new test up and ready for stress testing the login. What we will do is see just how much we can “tighten the belt”. That is, until what point can we increase the burden on the login while keeping an acceptable service level?
Say we run the first test with 100 concurrent users and as a result, there are no crashes, the response times are below 100ms (at least the 95th percentile) and the throughput is 50 TPS. Then we run the test with 200 virtual users and again, it results in zero crashes and the times are at 115 ms and the throughput is at 75 TPS.
Great, it’s scaling.
If we continue on this path of testing, we will at some point, reach a certain load in which we see that we are no longer achieving an increase in the throughput. We will also be getting errors (exceeding 1%, for example) which would indicate that we are saturating the server and it’s likely that response times from then on will begin to increase significantly, because some process or connection or something begins to stick amid all the architecture of the system.
Following this scenario, imagine we get to 350 concurrent users and we have a throughput of 150 TPS, with 130 millisecond response times and 0% errors. If we pass 400 virtual users and the throughput is still about 150 TPS, with 450 users the throughput will be even less than 150 TPS.
There is a concept called the “knee,” or the point at which performance “degrades ungracefully,” that we would be encountering with this type of testing illustrated in the graphic below. The TPS is expected to increase when we increase the number of concurrent users and if it doesn’t happen, it’s because we are overloading the system’s capacity.
This is the basic methodology for finding the “knee” when doing stress testing, when we want to know how much our servers can scale under the current configuration.
Then, the test that we will schedule to continue running frequently is the one that executes 350 users, is expected to have less than 1% error with expected response times below 130 * 1.2 ms (this way we give ourselves a margin of 20%), and last but not least, we have to assert the throughput, verifying that we are reaching 150 TPS. This is the way we can detect right on time when something degrades the performance.
It is a good idea to monitor the test for a couple of weeks to ensure that this analysis is valid and endures and that the test stabilizes. It is also important to review these values from time to time, just in case the performance of the service improves, in which case, we’d need to adjust the belt even tighter there! Meaning, we’ll always have to adjust the assertion values.
Tools
What tools are needed to accomplish this?
At Shutterfly, we are using Gatling + Jenkins + Graphite + AppDynamics.
I did not mention tools earlier because the same strategy could potentially be implemented with others. The important components are going to be: performance testing tool (for load simulation) + Continuous Integration Engine + Visualize and storage monitoring data + deeper information on the system’s behavior for analyzing problems and providing more information.
Results
The day someone enters a line of code that worsens the response time by 20%, or harms the scalability of the login, it will be detected almost immediately. This is the beauty of continuous performance testing. We can quickly avoid and eliminate any significant degradations of system performance, making sure that the end user never suffers from slowness when a new line of code is introduced.
Another advantage is that it’s not necessary to have the infrastructure similar to that of production. The baseline would be how the system is working today. If we consider it to be good, later we can see what the “knee” is in our test environment and we can work with that, without letting it degrade.
Where cost prohibits the pre-production test environment from having the same scale and performance as production, you can use a scaled down lower performance system and still detect performance regressions. In Shutterfly’s case, one side benefit of using a scaled down version of production is that we can easily generate all of the necessary load from a single machine, which makes reporting the results much simpler. Otherwise, we would need a huge test infrastructure just to reproduce the load simulation!
What do you think of this method? How have you managed to continuously test performance?
A special thanks to Melissa Chawla and Fred Beringer for your support and feedback, helping me to write this post!
Want to learn even more about Shutterfly’s continuous performance testing practices? Watch this webinar recording!
Recommended for You
Yoda’s “The Way of the Jedi Tester”: A Guide for Agile Testing
Gatling tool review for performance tests (written in Scala)
Tags In
Federico Toledo
Related Posts
Workload or Load Scenario in Performance Testing
In load tests we simulate the workload that an application will have in production, counting the concurrent users accessing the application, the tests cases that will be executed, and the frequency of executions by users, among other things. The concept of workload or load scenario…
How Soon is Too Soon to Carry Out Performance Tests?
Many developers do not know when to carry out performance tests for optimum results. It’s very common of them to leave performance testing for the end as an afterthought if they have enough time. In other cases, they ask around in their spare time to…
8 Comments
Leave a Reply Cancel reply
Search
Contents
I love what you guys are up too. This kind of clever work and
exposure! Keep up the wonderful works guys I’ve included you guys to my own blogroll.
Thank you!
Great article !!
This post has been very usefull to resolve issues in our experience with continuous performance testing. In our case I have developed a website to execute and manage all automated test cases from the cloud.
This tool gives our team the accessibility to all automated test cases. It helped us to give visibility to our performance tests and also because it makes it very easy to generate and analyze historical reports.
I have also built an app to add the tests to our CI tool, update graphical reports on the website and manage JMeter. The main advantages of developing these tools are the flexibility and scalability that our testing team have now.
I’ve set up something similar before, but our Functional Test Automation turned out to be just as effective as detecting timing issues. Also, the Continuous Performance Tests ended up finding more functional regressions than performance regressions. Did shutterfly have a similar experience?
John, we spoke to Melissa Chawla from Shutterfly and this is her reply to your question:
“We agree that the functional behavior must be correct in order for the performance measurements to be credible and actionable. This is why we have a few basic functional tests in our deployment pipeline that need to pass before declaring the test environment open for load testing. Yes, we sometimes experience functional outages that cause days of load test data to be invalid.
Functional tests can certainly be used to detect regressions in performance that occur regardless of how heavily loaded the system is. The type of performance problems we are interested in uncovering with load tests are the slowdowns that only occur under heavy load–the ones that would otherwise only show up in production during a time of peak traffic, when the outage or slowdown will most affect customers and your business. We have found many system bottlenecks under load that posed absolutely no performance problem during functional test.”
Hope this helps!
Daniel, I really like your tool! really useful for continuous integration considering performance. It is really easy to set benchmarks and rules to ensure that the metrics are not degrading.
Thank you! I will consider the tool for future projects
Hi Frederico,
Thank for the interesting read! At my client, we are experimenting with continuous performance testing using a framework build on a number of similar open source components, like Gatling and Graphite. I have build a dashboard application that can be used to organizing, analysis and automated benchmarking of test results. It is available on github: https://github.com/dmoll1974/targets-io. Documentation is still in progress but I have set up a demo environment with Docker Compose so you can check out the features. I’d love to hear your feedback if you find the time to check it out.
Cheers,
Daniel
Hello Daniel! thanks for your feedback!
I will have a look and tell you soon
Regards!