Template On-Call

The benefit of on-call rotations in SaaS companies is to ensure continuous platform availability and reliability and having the right on-call record from each will help you to improve in each occasion

Apr 17, 2024

In today’s issue, I want to share with you another resource that I’ve built over time, based on my experiences with On-Call duties.

Here are the links to this template:

Now, let me explain to you how to use it.

On duties table

In this table, you will include the people who will be involved in the On-Call duty, the dates, and the time. I see this useful because, in occasions like the weekend, might be possible that more than one member of the team will participate.

The timing I see it important in my experience because helps the team members understand from when to when they are on duty.

General Info

General info is the place where you put the context. All information that the person on-call might need but which is not the one to operate itself. For example, if the on-call is about watching a migration process from a cluster in Region A to Region B, you may want to have that information in this section.

So, it’s not crucial information for the duties, but is for reference. Imaging data is in the hard drive but not in RAM because it’s accessed it occasionally.

Runbooks

This section will contain code. The runbooks that the person on duty might need, with a short and clear description of what it solves or does. Think that the person on duty will have to operate fast, so the information should not have a complex structure

The operator should not have to take time to understand what to do; A copy/paste way is the aim for this.

Typically, bash commands for kubectl, connect to a cluster, restart a Linux service, etc.

I repeat again: commands to copy/paste and the command runs like a charm.

Known workarounds

You know, sometimes not everything works out of the box, and there are some glitches you have identified.

Here, as well, code. For those "know issues" you have a workaround in case they happen. You know that a restart of “service Z” will solve the problem “X”, indicate it here with the exact command to run.

Observability

Here you will have, in bullet points, the list of links to the relevant observability. The exact links, taking into account the possible regions, clusters, namespaces, or whatever. The exact link.

As you are interested in your services, you should interested as well in the links for the services that your system depends on. If your service uses a database, you want to have a link to check out how is it behaving.

Logs

Like the captain of a ship, the person on duty will keep a log of what happened during her/his watch. See the example provided in the template.

The log should be human-readable, so a fellow could understand how the things are going (or went), but not with too many details.

Final words

Hope you find it useful and I would love to hear your feedback about it. Put it in the comments or just reply to the email!

The Optimist Engineer

Discussion about this post