Reliability on a microservice cloud-based architecture

Let me share with you some of my learnings to ensure your cloud-based microservice architecture is reliable

Apr 10, 2024

I’ve recently been told that is not so common to see a cloud-based microservice architecture in companies that make software, either for their own product or for other companies. When one works in a company that leverages a cloud-based microservice architecture, and taking into account that the “microservice” has been a buzzword for some years already, you could make the mistake that all companies have the same. It’s not the case though.

One of many reasons is that microservice architectures are difficult and, if they are distributed in a cloud (or many) environment, even more difficult. So, ensuring that the services and systems living in such an environment work as expected is not an easy thing.

For the rest of this issue, I would like to share with you what strategies I put in place to (try) have a good level of reliability for the services and systems that my team owns in a cloud-based microservice architecture.

I would distribute the strategies in 3 dimensions:

Test strategy
Logging
Metrics, traces, and alerts

Test strategy

What I see that works pretty well is implementing your services via Test Driven Development (a.k.a TDD). Yes, it’s tough, I know, at least it happened to me and not only the first times I followed this practice but for some time. However, once you implement the first feature with TDD you will see the benefit, and the beauty, from minute 1; the quality of the code is just better, making it flexible enough in the parts that must be flexible and rigid in the parts that must be rigid.

Now, I will try to be more specific about the kind of tests I see as important to implement.

Acceptance tests. For me, this is a mix between unit testing and component testing. Both cases practicing TDD. In unit testing, we test small pieces of code, and with component tests, using Testconainers technology, we test broader features of your service.
Contract tests. We are in a microservice architecture, which means that, for a feature to work, the responsibilities might be distributed along different services or systems; these are called dependencies. This kind of test will help you to check, very fast, if the changes you introduce in your service might break the communication with its dependencies (API, database, message broker, etc). In my opinion, it should be possible for the developer to run these tests locally, so the feedback is fast. Also, in your Continuous Integration cycle, you will want to have a notification in your inbox, Slack, or MS Teams channel when the contract is broken.
Integration tests. In a cloud environment, you want to ensure that your service plays well with its dependencies. Even though the Contract Tests, mentioned above, will help you to ensure that the data you will exchange remains stable, there are some other problems, like deployment, connectivity permissions, and more, that you cannot tackle until you integrate. I would recommend running these tests the sooner the better in your development life cycle (in each merge to the main branch maybe?).
Synthetic tests. This kind of test has been recently discovered for me, I must confess. I like the definition from Datadog “[…] Companies can leverage synthetic testing to proactively monitor the availability of their services, the response time of their applications, and the functionality of customer transactions. […]“. For me, these tests should focus on the critical path only (meaning a small set of tests, maybe 1 or 2 only) and should run in a scheduled manner in every cloud environment which your service should be validated before jumping into production. And yes, I would recommend you to keep them running in production environments too; this will help you to know if your service is misbehaving even before your customers. In these tests, you will have to mimic the behavior of a real user, which might imply you have to deploy, and in the end cleanup, some parts of the architecture and dependencies your service uses. Send me a comment if you would like to have a dedicated newsletter issue about Synthetic tests and how I perform them.

This would be my test pyramid.

Maybe you realized that I did not write about frameworks or tools (maybe I recommended Testcontainers) but I’ve focused on the strategy; the kind of tests I see valuable and how and when to use them.

From here, let’s jump to the next dimension.

Logging

It’s important to log only what is important. Why? You do not want to end up the next month with a 1M bill from your observability provider because you wanted to log the messages when your Kafka consumer is not able to connect, in an endless manner, to the brokers because they are down.

But, what’s important? From my point of view, it’s important to log the full workflow of the features your service offers; from the input to the output. For example, if your service consumes data from a Kafka topic and puts the data in a mounted volume, you may want to log:

When the data is consumed from Kafka, with relevant and key properties.
If your service does some kind of process, put extra effort into logging the parts you might see as risky.
And when the data goes into the mounted volume, log things like the target volume name, region, and other key values.

Log the full trace for the features that your services expose. Think about when your future you will have to troubleshoot the feature and ask yourself: what I would need at that time?

Logging is one of the 3 keystones of observability, together with tracing and metrics. In the next dimension, I address observability in general and the alerting strategy for having a reliable service.

Metrics, traces, and alerts

About metrics, I use to use the ones that bring by default with the framework I use to implement microservices. For example, the HTTP Requests per second, the JVM heap, and so on; It’s not often I have the need to create a custom metric, but there are cases for sure. Do not go crazy on custom metrics though.

For tracing, It’s important for me to have a service map, which is a visual representation of the traces generated from the communication between my service and its dependencies, so I can detect possible issues or bottlenecks. Usually, the observability tools and providers have this feature. You will want to have the traces from the input until the output, inside of some complex parts of your code, and from which dependencies the input comes to your service.

Alerts are important. It’s often difficult to understand what alerts you really need and not be overwhelmed with too many, which might make you relaxed and to pay attention to them at some point for being too noisy for nothing. My approach to alerts is always the same:

Start small: Not too many alerts, just the ones for the critical functionality.
Stay concrete: Ensure the alerts can be understood by foreigners and provide information about “what to do” in the case.

Final words

As you read, I’ve put extra emphasis on the testing part, since it’s crucial (in my opinion) for cloud services. My test “pyramid” is not the most orthodox but it’s based on my experience. I would encourage you to give a try to Synthetic tests, as I said before, it was a great discovery for me.

Regarding the logging, keep it simple and concrete, as well as the alerts, tracing and metrics. Don’t try to control everything because you will end up with too much (and useless) information.

I would to read your comments about your experiences ensuring the resilience of microservices. Drop a message or reply to this email.

Best,
Marcos.

The Optimist Engineer

Discussion about this post