Quality Sense Podcast: Andreas “Andi” Grabner – Introduction to Keptn
In this Quality Sense episode, host, Federico Toledo chats with Dynatrace’s Andreas aka “Andi” Grabner. From Austria, he has more than 20 years of experience in the field and continuously shares his knowledge about performance engineering especially through his podcast called Pure Performance. During the interview, the two discussed performance engineering concepts, today’s biggest challenges in the field, the open source project, Keptn, and more!
- Understanding SLAs, SLOs, SLIs, and Quality Gates
- How performance engineers can avoid becoming a bottleneck themselves
- Improve the quality of your software with self-service performance engineering using open source tool, Keptn
- Pure Performance Podcast- https://www.dynatrace.com/news/pureperformance/
- Keptn – https://keptn.sh/
- Google: What is Site Reliability Engineering? – https://sre.google/
- Blinkist – https://www.blinkist.com/
(Lightly edited for clarity.)
Hello, Andi, how are you doing today?
Hola Federico. ¡Muy bien! ¿Y tú?
¡Muy bien, muy bien! I know that you have a strong connection with Latin America, right?
Yeah, I think it’s really strong and the ring on my finger tells me what type of bondage it is. Yeah. My wife is Colombian.
Yeah, I know. Perfect. I was saying before that I can see that your Spanish is improving really fast!
Well, thanks for that. It would be better if it’s even faster improvement, but I’m doing, as I told you, Duolingo every day. Based on the stats, I’ve been doing it for 306 days straight, but still not enough. It’s just like individual courses that I do, but I would need more time. But yeah, it’s definitely improving. That’s true.
Great, great. Andi, I wanted to talk about some things that I know that you have been researching and working on in the last couple years, more or less, and I know that you have been working in topics related to performance engineering mainly.
And my first question is, why do you like it? Do you find it challenging? Do you recommend other engineers to join this field? Why do you like it?
Yeah, I got hooked on performance with my first day back in 2001 is when I started with a company called Segue software, which then became Borland, and I think is now Micro Focus or … I think so. It was my second job. I started as a tester on a performance testing tool. The tool back then was called SilkPerformer. I started as an engineer, but in order to start as an engineer you first went through three months of testing. So I did testing on a performance testing tool, which was really cool. I was very lucky, I had a lot of great mentors, especially in that company.
I want to give a shout out to Ian, one of the architects back then at Segue. Now he’s also at Dynatrace, where I’m currently working, and he taught me a lot about performance testing, performance engineering, also [inaudible 00:03:57] who was my first boss in that company. And others along the way.
I believe, I think that the reason why I still love it, because I believe performance and efficient systems are more important than ever, because we’re all frustrated if things are not as fast as we expect, and don’t work as expected. I think we can contribute to making software perform better and faster, more efficient. And that’s why I still love it.
And apart from what you just mentioned, do you see any other change in the profession, since you started to now?
Yeah, definitely. I mean, back in the days, as I said, 2001 is when I started, we were mainly testing… Back in the day, stay classical, monolithic. It was .net, not even .net applications. It was Microsoft applications running on some Windows servers. The classical web applications. Stuff running on IIS and a SQL server in the backend. Then some big Java enterprise applications with the database in the backend. And now things have moved.
Obviously, we all know the trends towards smaller, service oriented components that overall make very complex systems. And there’s a lot of moving pieces. What’s interesting though, I think a lot of the basic principles of performance engineering are still the same, regardless whether you have just a big front-end and one database, or whether you have 10 different microservices in the end, there’s two things that talk to each other through constrained resources. That could be a network, this could be a connection pool. Certain things that connect systems, whether they’re just two components or 20,000 components, and therefore things have changed because there’s more moving pieces, but the general principles of performance engineering, looking at the key metrics, like CPU, memory, network, disk latency, looking at connection, pool sizes, threadpools, a lot of things still have stayed the same.
Yeah. 100% with you. I remember that when I started a couple of years after you started, the main challenges for us were related to how to process the data, and analyze the data.
I remember the first day I had the chance to use Dynatrace, and it was like, “Whoa, this is another way of working, and you don’t spend time in a spreadsheet.” I tried to correlate variables or things like this. You put your efforts in analyzing the behavior of the system. It’s really interesting to see how the field and the profession is evolving and we have more and more tools to help us in that analysis, in the optimizations we can do, but also the challenges involved.
So, what do you think are the most common challenges nowadays for performance engineers?
Well, I think that the challenge is that we have systems that are scaling very rapidly, but we don’t scale the number of performance engineers in the same rapid race, or pace. We’re building more and more services that are deployed more and more frequently, yet the performance engineers don’t grow on trees, and you can also not just do a clone, and that’s just not possible. That’s why I believe the biggest challenge [crosstalk 00:07:35].
That’s why I think the biggest challenge, and I think the biggest thing that people should think of performance engineers, or in general organizations, is how we can automate as much as possible from the performance engineering discipline, all the know-how that we have.
How can we automate, test execution, test analysis, and how can we integrate this into the regular development workflow lifecycle?
I’ve been trying to tell people, if you think you’re a great performance engineer, great, but you should take what you have in your head, all of your experience, and try to automate the work that you’re doing, and then make it available for an API, so this can be integrated in your CICD system so that every developer can benefit from it. Because you, as a performance engineer in a company, you should not become the bottleneck.
If you have more and more developers trying to look at performance feedback from the microservice. So, that’s why I think that the biggest challenge is automating performance engineering as much as possible, and providing it as I call it, as a self service for every engineer that needs to have performance feedback at any given point in time.
It’s a funny paradox that we are trying to find the bottlenecks yet we could be the bottlenecks.
We become the bottlenecks. Yeah, I think so, too. Yeah.
Cool. Before we delve into Keptn, which is the open source project where you are working on all these related things, I want to talk a little bit before about some basic concepts that maybe are related with that. For example, SLAs, SLOs, SLIs, and Quality Gates.
Maybe you can give us a basic idea of what these terms mean.
Yeah. First of all, we didn’t come up with it. I think even Google didn’t come up with it, but I think Google did a great job in promoting these concepts. So these three S terms, SLI stands for Service Level Indicator, to keep it very easy as service level indicators, like a metric, something we can measure, an indicator like a throughput, failure rate, resource consumption like CPU.
Then an SLO is an objective. What is the objective for a metric that you can measure? So, for instance, what is your objective for failure rate if you bring the system and the productive load. What is your objective of memory usage if your system is on the peak load? What is your objective of response time or latency if a system is under load? What do you expect from the system? Because if the system does not behave based on your objective, you’re not meeting your criteria.
That’s why you probably go back to the engineers and say, “We need to make this faster, more efficient.” So, SLIs and SLOs. Indicators and objectives.
And then the third one is SLAs, and I think this is a term that has been around and known for longer. Basically, an SLA is what happens if we actually run systems in production and we don’t meet our objectives. There might be a legal obligation. There might be a contract we have with our third parties that are using our software. These are basically what happens if our SLOs are not met in production, then we may need to pay a penalty. Then we may have, as I said, some obligations. Or we lose users because we don’t deliver a good service. These are the three terms.
If people want to read up more, Google has done a phenomenal job in explaining this concept and promoting it as part of their site, Reliability Engineering Practices. So SRE is the acronym. They have done a great job, great on that material out there.
Yeah. And I can imagine … I will share the link in the book as notes. I think these terms, or these ideas, are very important when you want to, as we were talking before, when you want to automate processes and analysis, and make automatic decisions, I imagine in order to promote from one environment to the other, according to the performance results you receive.
Exactly. This is where the Quality Gates come in. But before I talk about those, let me re-say something that will make people aware of a very critical thing here. At least in my experience as a performance engineer, I have many projects where people ask me to run performance tests, but they really had no clear understanding of what is actually the goal. What is good and what is bad. They say, “Run a test with a thousand users and then tell me what the numbers are.” But it should actually be the other way around. You should start from, “What are our objectives? What do we expect the system… What are our non-functional requirements? What are our SLIs, our SLOs and SLAs, once it’s in production?” And then from there work backwards, and then say, “Okay, if we need to ensure these metrics in production, how can we test for them?”
In production, you may have five metrics that you are interested in like availability latency, error rate, and so on. But then as a seasoned performance engineer you should think what is impacting the response time for the end user of this application? Well, there’s many moving pieces, the database, there’s the network between the microservices, there’s Kubernetes. Then you as a performance engineer should say, “Before it goes to production, we want to look at all of these metrics as often as we can, fully automated, and alert developers in case these metrics are not going into the right direction.”
And this is where Quality Gates come in. Quality Gates, again, not a new concept, but what we are trying to promote is integrating SLI and SLO checks on important metrics. Every time you build and you deploy, and you test and deploy a new version of your system in a pre-production environment.
When you run tests, every time, look at the metrics, and then compare them with your previous build and compare them against your objective. And in case you’re not meeting your objective, or in case you see a slow degradation that eventually will lead up in the problem, already alert the engineers about it.
Yeah. I find that, in many cases, the performance engineer needs to assess the different stakeholders in order to define those SLOs, because many times that you want their requirements and they say, “I don’t know. Give me the numbers,” as you mentioned, and then we’ll see, but this is not the best approach, for sure.
Great. How do we continue? We can talk about how Keptn helps us to automate all of these things.
Yeah, so Keptn I know. Thanks first of all for letting me talk about Keptn.
Keptn is an open source project that we have brought into this world, I would say, late last year. It is a CNCF. So a Cloud Native Computing Foundation project spelled Keptn. With Keptn, we wanted to solve a couple of problems, or we wanted to provide a new solution to things we know are problems in continuous delivery and in operations of large systems. We wanted to automate a lot of aspects around delivery.
Keptn can actually orchestrate the delivery process of your applications, of your services, in whatever platform. At a very core point, Keptn uses the approach of SLIs and SLOs. Meaning everything you do with Keptn, every deployment Keptn executes, every test it executes, it then always reaches out to your monitoring data or your test data, and looks at the metrics that you specified as your SLIs, and then compares them against your service level objectives.
This is why I think, and this is great that we talk about this here in this podcast, I think for performance engineers especially, or any type of quality-aware person, Keptn is automating an otherwise very manual and long task, which is, after a deployment happens, after you run your tests, it reaches out to your test results, to your monitoring data that was collected while your tests are running, is pulling that data in, and it’s then comparing it for you in a smart way.
What we’ve tried to do here, we’re calculating a so-called deployment score or a quality gate score. So we can look at two metrics at 20 metrics, at 200 metrics, but we always calculate a score between 0 and 100, which then indicates what is the quality of this particular build that we’ve just tested. And with this number between 0 and 100, where 0 was very bad and 100 is the best, you can then make an automated decision on what to do with this particular version. Do you want to keep it or push it to the next stage in your pipeline? Or do you want to throw it back? And if things are bad, Keptn will tell you which metrics have actually shown a regression. Did response time go up with your latest code change, did memory consumption go up? Do you now make more database calls than before? Do you now run on five pod instances and before you always ran on two pod instances with the same load? These are things that Keptn automates. It really automates the task of analyzing data from one or multiple tools, and aggregating this data up to a single number between 0 and 100 to make automated decisions.
And I guess if you have 20 different metrics, you can ponder differently, each of them. Right?
Exactly. The great thing is, the way we’ve implemented it, is you can define on every metric. So, again, we call the SLI on every indicator you can define what is the objective. And the objective could either be an absolute value. You can say, “Response time for this service over this endpoint has to be 100 milliseconds.” But you can also combine it with, “I want it to be faster than 100 milliseconds, but I also want to make sure we’re not getting slower by more than 5% to the previous build.” Because then you have a regression that you want to look at earlier, before you reach the 100 millisecond limit. That means you can specify this on every single SLI on every single metric. What you can also do is you can give a different weight for every metric, which means maybe response time is not the most important metric for you. Maybe failure rate is the most important metric, or maybe it’s five times as important than response time. You can give it a different weight, which will then be included in the calculation of the overall score.
Very flexible in that way. Excellent.
Yeah. Another question that comes to my mind, because I’ve seen this in different companies where they are at that different stage, let’s say. And in some cases I say, “Whoa, this is what they need to solve so many inefficiencies they have, maybe. In some other cases, I would say maybe they need to first focus on some other things before they get to that point. Have you identified in which state you should be, which things you should have already in place?
Yes. What you need to have in place, is a system that can produce consistent data. Meaning, you should have a system… you should have invested in, a continuous integration, a continuous delivery environment, where you can deploy a new version and then run tests that produce, let’s say, a kind of stable result. And with stable results I mean, the tests, they’re not flakey, they’re not breaking all the time. But every time I have a new build, I can trigger my, let’s say, Jenkins pipeline or whatever you have. It builds it, deploys it, and it runs a set of tests that are consistently executed, and consistently at least produce results. Because the thing is, if you have flakey tests, if you have a flakey test environment, then the best Keptn doesn’t help you. Because if your metrics are completely off all the time, then you don’t know, you don’t have a baseline. You never get a baseline. Yeah.
The decision is going to be made on false positives or…
Exactly. That’s the main sort of… To answer the question “What do people need to have?” They need to have some automation in place, and automated tests to deploy and then run the tests against a new version of an app. I’ve seen a lot of our Keptn adopters using either a Jenkins where they have done this, where they’re using GitLab for that, Azure DevOps is also very prominent where they have already invested in automated build, automated deployment, and in automated testing.
Cool, excellent. What’s the best and easiest way to get a start with it?
To get started with it? I think the best is to go to our website, Keptn.sh, where you can go to the Github repository. It’s github.com/Keptn. This is where you can find all the information about actually what is Keptn, what are the different use cases, and how to get started?
We have a couple of tutorials online, under tutorials.Keptn.sh, where we show you how to install Keptn on different environments and how to use the different use cases. Because Keptn itself, to be honest, I mean, not to be honest, but to tell you what the architecture looks like, Keptn is an event driven system and it needs to be installed on Kubernetes. So, while Keptn itself basically provides automation use cases, and they can be applied to any type of app deployed anywhere, but Keptn itself, the application, runs on Kubernetes.
Here we know that, while everybody’s moving to Kubernetes, not everybody is familiar with Kubernetes. You maybe don’t have a Kubernetes cluster, you don’t have expertise. That’s why we are providing installation options with lightweight Kubernetes distributions. One that we’re using heavily is K3S. So K3S, you can install on any Linux machine, and we have tutorials on how to install Keptn in three minutes, and then you can just use it. It’s like you just install the binary basically. I want to say, don’t be afraid if Keptn runs in Kubernetes, you don’t have to bring a lot of Kubernetes skills to get it running.
Okay. So you made things easy to just get started. Right.
Exactly. Because the reason why we made it that easy, because we ourselves struggled. We made a strategic decision for Kubernetes because we needed the event driven model, we needed the high availability, we wanted all these capabilities that Kubernetes provides. But then we also, when we started, we had not a whole lot of Kubernetes experience, and we saw our first adopters really struggle with it. That’s why we said, “How can we make sure that people get to Keptn and then not already stop at the front door, because they don’t know how to open the door.”
It’s amazing to see how open source projects like these are here available for anyone to take advantage of what you are working and providing, and helping us, helping others, to catch up with the latest methodologies and trends. Thank you so much for that. I think that the community is also very thankful.
A couple of other questions. Do you have any habits? Because talking about productivity, I have to mention that last week I sent you the idea of participating in this podcast, and you say yes, and here we are. So, I wonder if you have any trick related to productivity that you want to share?
That’s a good question. I think if you love what you’re doing, it is easier to get stuff done, and you’re actually getting a lot of stuff done. But I also have to say, I am not doing this always well, because I think I suffer from the same problem that we all suffer.
We are constantly getting a Slack message here, an email here, a Twitter feed here, we’re constantly getting distracted and we’re constantly context switching. And this is actually not that good because it just slows you down. So I think, and this is actually a funny, funny thing. The most productive I am, is when I’m on the airplane. Unfortunately right now I’m no longer on the plane, but the reason why I am productive on the plane is not because of the comfortable seats, which economy class, most of the time, is not that comfortable.
But the reason is, I made a decision that I never go online when I’m on the plane. I have zero distractions on the plane, which is why I typically fly, and then I crank out blog posts or a new library, or something like this. I’m doing a really bad job these days in trying to disconnect, also now as we’re working from home, because I’m always connected. But sometimes when I say, “Okay, I need to block time.” And I just focus on one work and then, whatever happens, I’m simulating I’m on the plane. I’m not available. I think this is very important.
Yeah. To have some blocks where you focus on one single task.
That’s what it is. Yeah, exactly.
I don’t know if you like reading, if you have any book to recommend?
Yeah. It’s actually interesting. I’m actually using an app these days. It’s called Blinkist. I’m not sure if you heard about Blinkist. They have taken a lot of these books, not novels, but industry books. They basically kind of condensed it into something you can read in 50 minutes. Basically, it’s a summary of every book that is meaningful in your line of work. There’s books around technology, books on physical health and mental health, and all sorts of things, like history. And they condense it into 50 minutes and you can either read the Blinkist, the blinks, as they call it, what they also have for some now a podcast version of it. Interestingly, I heard or read the last one was called Limitless.
It’s about also the … Our brain is actually limitless, but often we’re limiting ourselves because, for instance, constant distractions. I can really recommend that book. It’s called Limitless, by Jim Kwik, K-W-I-K. I read it or I heard it on Blinkist, but you can also get this book. That’s a great book. Another great book I can recommend is… I read this on my last vacation, it’s called Team Topologies. You can get it on teamtopologies.com, really interesting how modern organizations should structure their teams. And it also goes a lot, software organizations especially.
How you should organize your teams with value creation team, platform teams, then one is called the special teams that are involved with maintaining, let’s say more complex system or more complex system teams. Really interesting. How to organize your team structures for modern software delivery.
Interesting, thank you. Do you want to invite our listeners to join anything, to join one of your podcasts or…
Yeah, of course. I run a podcast with my colleague, Brian Wilson. It’s called Pure Performance. It’s all about performance. We’ve been doing this for three and a half years now, and we have a lot of people on the podcast. It started as a fun project, and three and a half years later… It’s really cool. It’s still there.
I can just say, feel free to follow me on Twitter or on LinkedIn or Github, or whatever you want. My username is typically Grabner Andi. It’s my last name, Grabner and then A-N-D-I, and you can see my content. It will be great. If people can take a look at Keptn, give us feedback, tell us what they like, tell us what they don’t like, tell us what they miss. Join the Keptn Slack channel, slack.Keptn.sh, and let’s help the world to build better software.
That’s a good motivation! Andi, muchas gracias. Adios.
Did you enjoy this episode of Quality Sense? Explore similar episodes here!
Recommended for You
Read the Ultimate Guide to Continuous Testing
- Our Strength Lies in our Diversity
- Abstracta is recognized with the award “Talent has no Gender” for its work towards gender equality.
- GoodFirms: Abstracta CEO Steers the Company Towards Its Vision to Co-Create World-Class Solutions, Improving People’s Quality of Life
- An End-to-End Guide of Load Testing
- Quality Sense Podcast: Ash Coleman – Diversity, Equity, and Inclusion at Work