The importance of concrete observability
I've recently participated on a AWS region migration of services and issues happened. Having a concrete observability will move your Time To Detect from hours to few seconds
“You never know you need it until you really need it”. This is something that maybe you noticed at some point when you were operating software.
This month I’ve been On Call duties during a weekend to assist in a migration of an AWS cloud region. Basically, I had to watch if something was wrong when my services were deployed in a brand new EKS cluster on the target region and act if something was not working properly.
And yes, a problem happened (lucky me, not at night this time). However, thanks to having concrete observability I was able to understand the issue in a couple of minutes, instead of hours, and then build a working theory and solve the situation.
In today’s issue, I cover
What is the Time To Detect
What is a concrete observability
How to achieve a concrete observability
Lessons learned
Time To Detect
You will find multiple pieces of literature about this DORA metric. Nowadays, there is some debate about whether this metric should be measured via the Average, the Mean, or the Median. This is a very interesting topic framed on the latest State of DevOps 2023 (analysis from 2022 results) and the
newsletter (master Abi Noda) had a very good issue in October 2023 about this, which I encourage you to read (and subscribe too!👇).For me, the Time To Detect is the time lapsed from the moment the problem is happening and the time you notice that the problem is happening.
In my case, that weekend, if an issue could happen, the Time To Detect was expected to be short, because a suite of Smoke Testing was scheduled to ensure the platform was working properly, once the new region was deployed. And so it was. One of the tests spotted that something was not working as expected and, in a matter of minutes, I was called to address the issue.
What is concrete observability?
Concrete observability in Software Engineering, for me, is a set of metrics, logs, and traces that provide all possible information about the context related to a use case.
When you are in a production incident at night, your main goal is to mitigate the issue so that, later in the morning, you can address it properly. Having a logline or a metric that tells you something like:
The file “mypackage.xml” was not be found in the AWS S3 bucket named “important-bucket“ in region “eu-west-1”. The installation will not happen.
Will help you, very fast, on
Understand the root cause of the issue: A file is missing in the AWS region, and without that file, the use case for “installing a package” cannot happen.
Have a working theory to have mitigation: Most likely the “mypackage.xml” was not uploaded to the S3 bucket (for some reason) so, if I ensure once again that XML files are present in the S3 bucket, I’ll have this issue mitigated and I can come back to sleep.
You can appreciate on the logline above that there are 3 key parts there, which in my opinion makes it concrete obserability.
The what, in this case, the “mypackage.xml” was not found, and it should.
The dependency, in this case, is an AWS S3 bucket and the region (very important).
What should happen and will not, in this case, an installation of that XML.
Said this, having concrete observability is not always easy. When you are developing your application or service, you not always can see the forest but only the tree that is in front of you.
How to achieve concrete observability?
First, you have to identify your critical use cases. Those are the ones that, most likely, will make you get a call at 3 a.m. at night. You know your services, your code, you know them.
For the critical use cases, you should focus on:
The entry point. Either a REST endpoint or the function that starts a process. Here you want to have metrics and logs.
Metrics, to understand the frequency of usage.
Logs, to record the input parameters.
The relationship with your dependencies. Dependencies like the example from the previous section: an AWS S3 bucket. Here you want to have metrics and logs too.
Metrics, to understand how often this dependency is used, and the time to get satisfaction from the dependency.
Logs, with all the context coming from the input parameters, all the context about the dependency (bucket name and region in the previous example), and what should happen after the dependency is solved.
Delicate code. You know that part of your service that has “sensitive” code.
Logs, to ensure that, in a troubleshooting session, the logs will give you the full journey.
On top of that, it’s really useful to have tracing all over the place, so you have the full picture of the interactions.
Do not be afraid of being concrete and verbose in your metrics and logs; remember, these are the critical use cases of your application.
Lessons learned
In my case, the Time To Detect was expected to be short, thanks to the Smoke Tests that were scheduled. However, that might not be the case all the time, and having the right alerts is crucial for having a short TTD. When I say the right alerts, I do not mean having a lot of them, to cover all possible scenarios, but those that you really want to get noticed because your critical paths are affected.
Also, the alerts should be timely and meaningful for the operator, providing a way to find the right Runbook to be applied in order to mitigate the problem. Yes, Runbooks, those short procedures that should be performed by the operator when a type of issue is raised, are very important to avoid getting you involved in a call at 3 a.m.
Last but not least, in every call you are involved, try to think about a way of not being involved in that call in the future. What you can put in place to avoid that situation in the future?
Final words
Hope you enjoyed this reading. Did you face a recent On Call situation? How was it and, more important, what did you learn from it? I would love to read your comments about this!