Real attendees’ questions and answers from our webinar about how Shutterfly runs performance testing continuously
Last week, I had the fortune of hosting a webinar with Melissa Chawla, the Senior Manager of Performance Engineering for one of our clients, Shutterfly. The webinar was called, “Learn How Shutterfly Employs Continuous Performance Testing to Deliver Winning Customer Experiences Build After Build.” If you have never heard of it, Shutterfly is an e-commerce company with over $1 billion in revenue in 2015 that specializes in custom photo gift creations. With its complex software that enables customers to upload photos and create custom products and due to the overall nature of the e-commerce industry, its website’s high performance and reliability remains a critical factor in Shutterfly’s success.
The webinar viewers asked several great questions that went unanswered due to time constraints. So, Melissa wrote out answers to those questions afterward and we felt they were too good not to share with our readers. Therefore, the purpose of this blog post is to publish them in order to share even more of her and Shutterfly’s performance testing insights.
First, here’s a brief recap of the content of the webinar.
Melissa enlightened us about how the Abstracta team and her team together manage to test performance continuously, in order to detect performance degradations almost instantly. If you missed it, you can watch the webcast or view the slides in order to learn more about the methodology used to profile tests, key assertions, tools used, challenges to overcome, and her recommendations for implementing these tests.
One of my favorite moments of the webinar was when she said:
Q&A with Melissa Chawla
Now, back to the Q&A. The following are her written answers to viewer questions.
Question: “How do you determine what requests should be used in your performance testing script?”
Answer: We typically target the web services that are most critical to the business (aka happy path or money path), those that have undergone significant and/or risky changes since the last big peak period, and also those that have, in the past, caused performance problems in production.
Question: “I’ve read about it and some people recommend to delete some type of requests such as fonts, images, etc.”
Answer: I’m not sure the intent of this comment/question, but I will agree that we don’t have the time or the resources to cover every case with our load tests. We make tough calls every day about what coverage to include and not include. The key here, I think, is to be aware of what isn’t being covered and not to create a false sense of protection that can lead to disappointment later.
Question: “How are you doing the ‘Continuous’ part in this model?”
Answer: You are correct to point out that our testing does not conform to the “continuous” expectations of “continuous integration”. We don’t have enough load testing environments to support running a separate test for every code check-in. Note that load tests are generally more effective when ran for longer durations so they can detect slow degradation/leaks. At Shutterfly, we have some code modules that build after every check-in and others that batch and build every 30 minutes or more in some cases. Because our test environment consists of various interdependent server pools, with different code builds/versions, we chose to allow software updates to our test environment on a daily basis to limit the time spent restarting server pools. Because of all of the interdependencies, we find test results can be skewed if any resource is updated/restarted while the test is underway. We deploy the latest builds to all of the server pools in the environment every morning, and then we run through as many tests as we can before the next day’s code deploy starts. What is continuous here is the following:
1. Tests are automatically queued and run 24/7. Humans check the test results during business hours, but no one has to manually operate or monitor the tests while they run.
2. Once a test is developed and profiled, it will automatically run several times per week.
3. A test failure/recovery email is automatically sent.
Question: “How does the capacity of your test environment compare to production’s?”
Answer: We have a mini-prod environment that is shrunken but analogous to production. Where possible, all of the units of hardware in our test environment match production (CPU, Memory, RAM, Heap size, OS version, etc.). For cost reasons, some of the databases in our test environment are not as performant as in production, and we have to adjust our performance expectations accordingly for operations that are Database-bound.
In production, we have many server pools performing different types of requests, all of which have anywhere from 2 to 300+ hosts in the pool. In our load test environment, we have all the same server pools, but with 2 of each host. We require 2 rather than 1 mainly to exercise the load balancer config. Another benefit of having at least 2 hosts per server pool is to have the ability to occasionally test a failover scenario. A 3rd reason to have at least 2 hosts per pool is that we have 1 host with AppDynamics tracing enabled while the other host does not. This way we can see the difference in CPU, heap and memory consumption between hosts with and without AppD (which we also have running on some hosts in production).
In a few of our more heavily used server pools, we have 4 and even 6 physical hosts per server pool, so that we can better saturate modules that are downstream of those pools. The idea is to preserve the relative ratios of servers per pool in non-production.
Question: “I’m curious about the standards/format of the unit load tests by developers; How much flexibility do they have in building a unit load test? What do their tests look like? Are they conforming or very different?”
Answer: There is no hard standard, but for engineers new to the load testing effort, my team points them at existing tests covering similar web services so they don’t spend time reinventing the wheel. My team does review most of their tests, so we can help avoid ineffective tests. The test code is generally straightforward. The hardest part at Shutterfly is crafting representative data sets for our tests (example: when testing photobooks, how many pages, photos, etc. do the book projects have in them?). The more time you spend varying your test data to match real use cases, the more likely you are to uncover real performance problems. Also, some time should be spent cleaning up your databases so you don’t see performance degradation that production will never encounter. In general, load tests that degrade slowly over time are often an indicator that data cleanup is needed. In some instances, that data cleanup was actually going to become a production problem too, so not just a test bug.
Question: “Would you ever use 95th percentile value as a hard SLA threshold? Or just for general measures to more accurately understand the majority of response times included in the frequency distribution?”
Answer: In truth, Shutterfly doesn’t really operate with hard SLA thresholds yet. The threshold we place on our non-production load tests is purely to notice significant or potentially imminent degradations automatically. Depending on what caused the degradations, we sometimes accept them and adjust the thresholds to allow for the slower performance. Example: some security changes, such as shifting from HTTP to HTTPS are known to reduce performance, but the added security is worth it.
Question: “Do you have any automated mechanisms to protect or isolate your performance test environment?”
Answer: For cost reasons, and because they are in the same datacenter, we still have some resources that are shared (or potentially shared) between non-production and production, but the code deployment is totally separate. We have firewalls and alerts in place to let us know if non-prod is making unreasonable requests to prod and vice versa. Our non-prod environment is behind our site firewall, so it is not subject to external DOS attacks. It is a continued challenge to keep on non-production systems (hardware, patches, etc.) up to date with and ideally one step ahead of production—this does not come for free, but we think the effort is worth it!
The whole non-prod load testing program cannot work without the support of our ops department building & maintaining the non-prod environment. Approximately 25-50% of the issues our non-prod load testing turns up are ops-related issues, and at least another 25% are code issues that would have caused ops emergencies in production if we had not caught them prior to release, so our OPS department is grateful for the preventive efforts of load testing.
Question: “Please explain re-calibration; Calibrating to what? Compare to PROD? Compare to historic trend? What ‘things’ are you adjusting when you re-calibrate?”
Answer: I like to use the term “profiling” whether you’re working with a new test or an existing one whose performance changed. Performance profiling is about finding the knee in the curve, as we discussed in the webinar. In some cases, when the performance characteristics of an existing test change, to save time we keep running tests at the same load and just change the assert thresholds to match the new (and hopefully improved!) performance trends—I would not call this re-profiling but rather an assertion adjustment or a new SLA. When there are dramatic performance changes to an existing test, we should re-profile as though the test has never been run, trying out varying levels of load and finding the new knee in the curve. For both initial profiling and re-profiling, for services that have already shipped to production, we do look at production performance during recent peak periods, and we try to validate that non-production performance is roughly in line with what we would expect given the difference in scale. With re-profiling, we have the benefit of knowing the historical trend for that service in non-production, which helps us put the new numbers in perspective.
And there you have it, some knowledge into the complex word of performance engineering at Shutterfly. A huge thanks to Melissa Chawla and the whole team at Shutterfly!
Missed the webinar? View the recording!
Recommended for You
How Shutterfly Masters Continuous Performance Testing
Gatling vs JMeter: Our Findings
Quality Sense Podcast: Leandro Melendez “Señor Performo” – Performance Testing Explained Simple
What do restaurants and performance testing have in common? We’ll spoil the punchline…. It’s servers that get stressed! In this episode of our software testing podcast, Federico interviews Leandro Melendez, whom he refers to as a “Latin American brother” also known as “Señor Performo.” Originally…
What You Need to Know About the Difference Between JMeter 4 and 5
Using JMeter 5? Understand this key difference that will impact how you set up your tests… At Abstracta, we’ve been experimenting recently with JMeter 4 and 5 (the latest versions to date) and we found a very important difference in the way they behave and…