In today’s issue, I want to tell you about an interesting technique to ensure that Software Engineers take care of the observability of their services and, as collateral benefit, the issues in production.
The problem that most Software Engineers have is that we can create many good metrics, traces, and logs, for later to set up an amazing dashboard (in Grafana, New Relic, Datadog, etc), and end up not looking at it afterward. I saw this many times.
But what about this? In your office, place a TV in the desk area where the Software Engineers work their 8-hour days. Ideally, position it at a height where it is visible from the entire office, and use it to display the alerts dashboard.
You may say “Marcos, having a dashboard is not practical, you cannot keep switching your browser tabs all the time to see the alert”. You are absolutely right; we are software engineers, not watchdogs. This proposal is a complement, something on top of the process to have direct alert notifications that your team members will receive if they are on call duty. But during office hours, you may miss the notification (email, ping in Slack).
If an alert is raised on this dashboard, during office hours, which is highly visible to everybody, everyone will start to ask “What’s going on?”. Someone realized that alert popping up before you (owner) did, but this is something that will happen 1 or 2 more times because later, you will not want to have that feeling anymore; being the last one to get noticed your service is having troubles. You will see that on some occasions the Software Engineers will jump on the incident before even they get the Pagerduty notification in their email.
The benefit of this proposal is that your Software Engineers will take care of having the right observability in place, the one that really matters and which is important, because they will have to jump on immediately if an alert pops up, and this is something that no one wants; a production incident goes against your customers hence to you!.
Another benefit is that you will find the blind spots in your observability. It will happen that an incident in production pops up, for example, a problem in API Gateway or the DNS resolver, and your beautiful dashboard is completely green. Yes, your product is not working properly for your customers but your citadel is perfectly OK. That’s not cool and you just found a spot to improve in your observability! Be careful here though, and focus on use the SLI that belongs to your ownership. You can find an interesting article about this written by
👇🏻Let’s wrap up for today; The takeaways
Create an alert dashboard for the teams in the office.
Set it up on a TV screen that is publicly (within your company) visible.
Yes, this does not work for full remote work.
It’s a complement to direct alert notifications via email, slack, SMS, etc.
It will help to improve your observability
If you put in place this technique (a trick?) I would love to hear from you about how did it work for your use case! Write down in the comments your experience or just reply to this email!
Best,
Marcos.