The who, what, why, and how of observability for teams building software

In this Quality Sense episode, host, Federico Toledo, sits down for a virtual chat with Lisa Crispin. She’s recently assumed the role of Quality Owner at the low-code development platform, OutSystems, which is based in Portugal. She’s also very well known for her work with Janet Gregory, as one of the founders of Agile Test Fellowship and a co-author of many influential books on agile testing.

What to Expect?

  • The two discussed what this new term “observability” is, and how it’s related to monitoring, testing and even chaos engineering
  • The challenges of testing modern software with all of its complexities and dealing with the “unknown unknowns”
  • How the project OpenTelemetry and different tools can help to implement observability
  • What to read if you want to learn more about Continuous Delivery, DevOps and Observability
  • How collaboration and pairing is a great way to learn and improve whatever you do

Listen Here:

Episode Transcript:

(Lightly edited for clarity.)

Federico:

Hello Lisa, thank you so much for joining the podcast!

Lisa:

Federico, it’s my pleasure. I’m so excited to have this new podcast and I’ve been listening to the episodes. Really great stuff.

Federico:

Thank you. Thank you so much. So you’re starting a new job, right? How is it going?

Lisa:

Yeah. Just at the end of March, I joined OutSystems, which is a low-code platform and it’s quite interesting. 

My team is in Portugal, so that’s a new experience and it’s interesting to start a new job in these strange times when everybody is remote. 

It’s kind of good for me because I was going to be remote anyway, but it’s interesting to meet new people and not only do you meet them, but you meet the people that are living with them [due to COVID-19]. Their kids, their pets so it’s a more personal level I guess. 

And of course, people are a little more stressed out with all that’s going on too. So it’s an interesting time to start a new job but learn something brand new. But I’ve been really enjoying it and I’ve been starting to learn Portuguese.

Federico:

This is the first time you are working for a company outside the U.S or?

Lisa:

No. Actually, years ago I worked for a German software company and for a long time and I was based in Denver, but I got to go over to Germany for three or four weeks at a time and work with the developers there, which is really fun. So it’s exciting, it’s really interesting to go and see how people build software in other parts of the world. 

I think there are fewer differences now that we’re all so connected, but back in the 80s and 90s you would see really different ideas and some areas more innovative than others. So very interesting.

Federico:

So the main topic I wanted to address with you today is related to observability. I know that you have been working around the topic lately.

Lisa:

Yes, I am. My new job is, my main focus is to help build an observability practice at OutSystems, both internally for us as a company, supporting our customers. Then also for our customers who build an application with our product, to build observability into that for them too. So it’s a big effort. It’s just getting started.

Federico:

So just to start, what is observability about? Because I think it’s a pretty new term.

Lisa:

It’s really new and definitely in the testing area, a lot of testers haven’t heard of it. But even when you talk to people who work in operations that have done monitoring and logging and all those things, they don’t really know what it is. 

Some people think, “Oh, it’s just a new name for monitoring.” And of course it’s quite different than that.

Federico:

How is observability different from monitoring?

Lisa:

Well we’ve had monitoring for a long time. We’ve had our dashboards, we write log data to a log file so that when something goes wrong, we can debug it. 

The thing is that we tend to instrument our code for the things that we know might go wrong or that we expect could go wrong. We set up alerts when we exceed certain performance thresholds or error thresholds. We provide alerts so that somebody can look into it. If there’s a crash, we record the crash. 

But if something happens that we didn’t expect, we don’t have any logging for that part of the code. I’m sure a lot of people listening have had this experience just say, “Okay, we can’t figure it out. It’s a bad problem in production. Now we have to add some logging to our code and re-deploy.” 

And depending on how easy it is to deploy in your environment, that can be really painful under a previous issue.

Federico:

And reproduce the issue, right?

Lisa:

Right. Well, you have to get more information about it so we can reintroduce it. So it can be a long, drawn out process. Meanwhile, your customers are feeling pain. 

So observability is instrumenting more things in our code, more events in our system, so that if something goes wrong that we totally could not have imagined that (all of the risk assessment and things that we did in advance) we just didn’t think of it. 

Observability lets us delve in and see, “Oh, what went wrong? Oh, what module was that? Oh look, this one thing took a long time and we didn’t expect that, now we could drill into it. We can also trace what users did.”

So we have really a lot of complex distributed systems now, and our user takes a path through our application. 

“Observability allows us to trace what users did across all the different services and APIs to see exactly all the steps they took so we can reproduce the problem. It’s a way to explore production using really sophisticated tools that are available to us and also because we’re able to store huge amounts of data at an affordable cost now.”

LISA CRISPIN

So these things have come together, this wasn’t possible a few years ago.

Federico:

Yeah. Now that you mentioned that, and I’m thinking that’s true, this is a key tool to have the possibility of analyzing more things in different parts of the system because the systems have more components and more things to monitor or analyze. So you need more space to store all this data. 

But what I understood is that basically, we have to get ready for possible problems that we don’t have any idea that occur, right?

Lisa:

Right, right. The “unknown unknowns,” as they say.

Federico:

Yeah. Huge challenge.

Lisa:

Yeah. One of the best analogies I’ve heard, or maybe it’s a metaphor. I never know the difference, but I think it’s Pierre Vincent who said something like this in his talk at TestBash Manchester:

You might have a very fast and powerful car, which could get you someplace very fast, but if it’s foggy and you don’t have visibility, the car has to go very slow. 

And you compare that to a jet, it can go fast even in the clouds because it has instrumentation that allows it to have visibility. A different kind of visibility. 

And so he says this is a different kind of visibility into our software that allows us… We’re in a fast paced world, we’re doing continuous delivery, we want to be able to deploy small changes all the time at a sustainable pace. And this is one of the important components of that.

Federico:

Interesting. How can we as testers collaborate in that because observability is a property of our systems, right? So we should test how observable our systems are, right?

Lisa:

That’s a really good point, Federico. 

I’m struggling with that myself. If you’re lucky enough to work in a team that already has learnt how to instrument the code properly, create these spans and events, it’s a whole new terminology and has tools that allow you to do the tracing and help you identify exactly where the problem occurred. 

Then, you can just build relationships with the people, the platform engineers, site reliability engineers, developers, who are using those tools and maybe pair with them and learn the tools. But I think very few organizations have really mastered this so far and so what I’m doing where I work is just, again, it’s about building the relationships. 

There are people in R&D where I work who know something about it. They certainly know a lot about the logging or building down on the system and the monitoring already available. But also the people on the front lines helping the customers. They tend to know the most about it because they’re the ones struggling with debugging the problems.

We’ve had to create teams with people from both sides of that organization and work together to do some proofs of concept and figure out what’s the best way to instrument our code. 

There’s a new standard called OpenTelemetry that looks like it’s going to become an industry standard that provides a way to instrument your code to create these events and store them that’s compatible with a lot of tools. 

So different vendors are contributing and then supporting the standard so that if you want to choose a different tool for observability, you don’t have to re-instrument your code.

So that’s really helpful. These things are kind of, that’s just a pretty recent thing. These things are just coming together. 

It’s like this is really the infancy I think of observability. So I think it would help me a lot and people I know, like Abby Banks who’s also a tester and she’s the one who got me interested in this and she’s on the platform engineering team at MU right now. 

And talking to her, one thing that helps her is that she has pretty good coding skills. She has a development background and she can actually go into the code and poke around things herself and understand what’s going on better. I had to be pairing with the developer to do that because my coding skills aren’t that good. And so I do feel like that’s a limitation in me getting hands-on with it. 

So I have to rely on collaboration. I need to get the engineers to pair with me, to work together with me and that’s an extra challenge. But at the same time, I find I can bring them value because I’m thinking of questions that they haven’t thought to ask themselves yet.

Federico:

I don’t think collaboration is a bad thing.

It requires two people to solve the issue. But at the end of the process, you will probably learn something about how this solution is built and he or she will learn something about how to think critically about the way the user uses the system. 

Lisa:

Yeah. And of course it also holds me back that I don’t know the product very well and I don’t know how the customer uses it. So that’s my other effort is really ramping up and taking the tutorials and learning the product because that’s important. You have to understand the data, you have to understand the types of things people are doing so that you can know how you want to investigate the problems.

Federico:

Something that you told me that the last time we spoke, that I found really interesting about the relationship between observability and testing. Basically, the goal of both are more or less the same. 

Because it’s like getting information about the usage of the system or the quality or considering different aspects of the application with the goal of providing this information to someone who is going to make a decision.

Lisa:

Right. Well, it definitely helps us. There are a lot of different ways and the lines are kind of blurred here I think between maybe observability and analytics, but I think it’s also important how you instrument your code for the analytics and what are users doing? 

If we have a new customer, what do they do first? Where do they give up and abandon the product because they couldn’t figure it out? 

So we need that information for building our product and building the next new feature. But we also need the information to know where they are spending the most time. What seems to be the most valuable to them or where are the most errors occurring? So where do we want to focus our testing? We want to focus our testing on all the things that are valuable and all the things that are problematic.

And so having this insight into production… It used to be, we just guessed at that, right? 

Maybe we went on an interview with customers who we certainly have a lot of information in terms of tickets or things people reported problems with, but that didn’t really tell us what was most valuable to them. 

So this is an extra tool because it’s getting harder and harder to test everything. We have continuous delivery, we’ve got a fast pace. Our test environments don’t look like production, even less nowadays that we have distributed systems. 

We still want to do all the testing we can before we release, but we have to take advantage of all these extra tools like observability, testing in production and using feature toggles, chaos engineering. 

So I think it’s all part of the package and testers really need to be involved on both sides of that continuous delivery loop or DevOps loop. We need to be involved all the way around because we need that firsthand knowledge of what’s going on in production as well.

Federico:

Yeah. Shifting to the right, the focus also in the right of the loop.

Lisa:

Yeah. The right side of that loop and the left side of that loop. So it really informs us the whole way. 

Federico:

I haven’t thought about that… Chaos engineering is also related to observability because it’s maybe a way of trying to improve the observability of the system by injecting errors, right? And trying to understand that your team is capable of analyzing and fixing the issue with the information they have.

Lisa:

Right. And I’m definitely not a chaos engineering expert, but…

Chaos engineering is one way to discover the unknown unknowns. It’s like, “Oh, let’s bring a server down and see what happens. Let’s drop a database table, see what happens.” 

And you don’t have to do it in production, you can do it in the staging environment. But a couple of years ago at European testing conference, Sarah Wells did a keynote on what they do at Financial Times and she called chaos engineering, “tool-assisted exploratory testing in production.” And I thought that was so apt since these are all forms of exploratory testing and learning about our production system.

Federico:

To do that, to do chaos engineering or to work on improving the observability of the system, which skills do you think we should develop or improve?

Lisa:

There’s a whole range of skills. Definitely, when we’re working on our code, building new features, we want to think about testability, operability. What’s the best way to instrument our code? How do we make sure we capture all the information we need? As testers, we have a lot of new tools to learn, right?

As a tester, I find that tool very interesting because you can go in and do queries and it helps you, it gives you help in doing that. It’s like doing SQL queries of the data, but it has a UI that helps you do it. 

And as soon as you start to see a pattern that’s like, “There was a long response time here.” You can dig into it and it has a bubble up feature that suggests to you, “Here are the different pieces of data that had anomalies in them. You may want to investigate these. So this particular module took a long time or this particular function took a long time.”

And it speeds up looking into it. And there are other tools, LightStep is the tool and Kibana has a new APM that does tracing and stuff, so that supports observability as well. 

But the other thing I see is, and I don’t know what these tools use behind the scenes, but I even wonder if machine learning could be applied to detect these kinds of anomalies. Not to solve the problem for you, because people talk about AIOPs and that Artificial Intelligence could do all this for you. I don’t think it could do it for you, but I think it could help you investigate more quickly because it can bring those patterns alive.

Federico:

Yeah. You mentioned finding patterns and so I think that analyzing the history of your logs and metrics and everything can help to identify the patterns easily, right?

Lisa:

Yeah. Testers, we’re good at spotting funny patterns. We’re good at identifying risks and other people on the team are too. 

And now that I’ve started being able to, I’ve been able to attend a lot more DevOps and continuous delivery type conferences (because they’re all online) and what I find is mostly they’re talking about testing and quality and they are specialist testers themselves, but they obviously, that’s what they’re doing a lot at the time. And I’ve found them very welcoming to testers of, “Yeah, come and help us.”

Federico:

Yeah. I had the same feeling the first time I read the continuous delivery book.

Lisa:

Exactly. Yeah.

Federico:

I said this is a book about testing, right? But-

Lisa:

It sure is, yeah. 

So I wanted to encourage testers, it sounds scary to talk about DevOps. These tools can be scary because they’re not intuitive, you can’t just jump in and use them because like I said, you have to understand your data and your application and what your customers are doing. 

But testers, if you already know your domain and you already know your customers, it’s going to be easier for you. 

Federico:

Yeah. And not only that, because from what you mentioned, I understand that you need an understanding of the architecture and the technology behind the system in order to understand where to pay attention or what to look for, right?

Lisa:

Very true.

Federico:

Maybe training with different people in your team could be very useful.

Lisa:

Yeah, I’m counting on that myself.

Federico:

Cool. What are the typical sources of information? 

Lisa:

That’s a really good question. I have found it really helpful to have engineers walk me through their diagrams of architecture. I may not understand it, just looking at it, but having them walk me through it makes so much sense. I love the visuals, they’ve really helped me learn. So that’s one area. 

I think just learning about that data, what’s all the log data being gathered? How could you put that data together to say something meaningful, to look into things?

There’s the monitoring data, there’s data from our systems as a company, there’s data from our customer systems and their applications. So there are all these different levels. And I actually made a big mind map of all the teams who were doing any kind of logging or any kind of monitoring or working on observability and all the tools that we’re using. And it ended up being this giant mind map and I was really surprised, of course it’s a big organization.

But people in different areas have realized they need data, they need better data and they’re not all doing it in the same place or in the same way. But it’s definitely a big focus and I think more and more companies are going to see they need to make a big push in this area because it’s an opportunity. 

It’s something we didn’t have before and if it can reduce our pain in the investigating of the cause of our problems or preventing customer problems… because now that we have things like dark launches, released feature toggles, progressive rollouts, we can put something in production and not only monitor it, but use the observability tools to spot unusual patterns. 

So it just gives us all these extra tools to help us solve our customer problems faster. 

Federico:

Another question that comes to mind: When should we start working on observability in the development process? 

Lisa:

Yeah. I think when we’re planning new features, new changes, we have to talk about how we need to instrument our code. What do we want to capture as we’re doing this? And I think it’s all part of the piece of making your code testable, making your code and operability I think is what relates to observability of making sure that you’re capturing all the events that you need to capture and capturing all the data so you can trace user journeys through the application. Be thinking about that as you’re creating them. 

And even if you’re working on a legacy system, I had a really interesting conversation with Austin Parker last week. He works for LightStep and that’s one of the observability tools and he’s also contributing a lot to the OpenTelemetry open source project.

And he has free office hours! You can just go sign up to spend time with him and ask questions. I talked to them last week and one of the things he said is if you’re breaking up a monolith, don’t wait until you’ve got that done by the instrument. 

You can go ahead and instrument your code with something like OpenTelemetry, start capturing all this information so that you can see things like performance and latency and then as you break a piece of that monolith off into an API or service, now you can see how’s the performance now that you’re going from the monolith through the API and back to the monolith. 

And I hadn’t thought about that, it’s like it helps guide you so it helps you know if you’re going about the right way of breaking it up, if suddenly you have a performance problem when you do that you know, “Oh, you did something the wrong way.” You need to go back and rethink that.

And so it can also act, the OpenTelemetry can act as a safety net to help guide refactoring a big legacy application. I thought that was pretty fascinating. So there are a lot of different applications for that. 

As I learned more, I’m learning more and more uses of this kind of data. And it’s just like anything, it’s just like when you don’t have any automated unit tests and you’ve got a legacy code base, what do you do? You start refactoring it and as you refactor it and try to design it better, you use unit tests to help you design it and to build up the safety net. And over time you build that safety net up bigger and bigger. And I think you can do that with Telemetry as well, both for monitoring and for observability.

Federico:

Excellent. Some final questions because I think we could continue talking about this for hours.

Lisa:

Well, I feel like I’m just talking off the top of my head and I don’t really notice in depth, but I just see so much potential in it and so much value in it.

Federico:

That’s amazing. So one of the questions I have is that if you have any books to suggest, it could be related to observability or to anything else.

Lisa:

The book I recommend the most, I hear a lot of people recommend it. So, maybe it’s kind of boring, but the Accelerate book by Dr. Nicole Forsgren and Jez Humble and Gene Kim. 

I’ve found that really valuable because as my own teams now are trying to make this shift and move, build in observability, we’re trying to do a lot, we’re trying to succeed with continuous delivery and so knowing what we can measure to see if we’re progressing on our journey. 

So the Accelerate book provides those metrics that correlate with high performing teams. So we know good things to measure and it helps us with the culture and the leadership that we need.

So these are big efforts, especially when you have a legacy code base and you need visionary leadership and you need a lot of support and models and things to help the teams know how to go along on this journey. 

And it’s all part of a piece of getting to that, being able to frequently deliver small changes that are valuable to customers at a sustainable pace and lower risk because we are making really small changes that we can revert if we need to or turn that feature flight off if we need to. 

It’s all part of a piece. And so I like how Accelerate, the information there can provide a foundation no matter what business domain you’re in, it can provide a lot of guidance for you I think.

Federico:

You’ve reminded me of a great webinar I saw a couple of weeks ago… Someone talking about BDD and continuous delivery!

Lisa:

Oh, that was a fun webinar! These things support each other, right? Like I say, “All part of a process.” And even I think some people, I think maybe Abby Banks is doing this, exploring: how do we do BDD as we develop our infrastructure? So the infrastructure that supports continuous delivery, the infrastructure that supports observability and monitoring. We can drive those with a business facing or operations facing tests as well. All these things we know and as testers we can apply them in so many ways that help our teams.

Federico:

Yeah, in different stages or in different parts of our processes we’re making. Another question: do you have anything to suggest to our listeners to check like any future trainings?

Lisa:

Oh, such a good question, Federico. We do have our Agile Testing for the Whole Team course. 

That was a three-day live course and now, Janet has adapted it to virtually facilitated. So on Zoom or something similar, and the course can be done now in five days. Janet is going to do it five days, four hours a day. Some people are stretching it out even into 10 days. 

But Janet is going to be offering it for the first time for herself in June and this is a great opportunity. If you’re in a time zone where it works at all for you to learn directly from Janet and be able to ask her questions and pick her brain. 

And she’s absolutely the best facilitator. She asks the best questions that really make you think and getting that hands on practice and working with a small group… and she’s only going to have 10 people in the course. That’s a really, really small group. 

I think it’s a really special opportunity. But all our trainers do an awesome job. 

So if you go to agiletestingfellow.com, you can see all the training being offered and this can also be done as a private course within your company. So we’re really excited that we’ve been able to adapt it for remote. 

Federico:

Thank you so much, Lisa. It was a pleasure. I enjoyed-

Lisa:

Thank you, Federico. Yeah, that’s wonderful. And I’m honored to be here.

Federico:

Thank you. See you around. Bye, bye.

Lisa:

Bye, thank you so much.


Recommended for You

Q&A with Lisa Crispin: BDD and Continuous Delivery
Quality Sense Podcast: Alon Girmonsky – Testing Microservices with UP9