How to do observability of distributed systems

20 May 2025

This is a guide how to set up observability in a fairly mature company.

I assume that you have a product-market fit. If you work in a scale-up or an enterprise, you might find something useful here. If you’re building a startup and you have no idea whether the app will work or not, this guide is not for you.

This guide also assumes that you already have some monitoring in place. Or at least infrastructure for observability. That means, you might already have some metrics, alerts and logs, but it’s a mess, it takes a lot of time to fix issues and frequently you hear about problems from your users.

Start with defining SLOs for features

First, you define SLOs for user facing features.

A very good example of a breakdown and uptime definitions are status pages.

Incident.IO is a particularly good example, because it not only has a breakdown of features, but also shows a history of incidents and defines the uptime.

How to define an SLO? I have a rule of thumb: I want to measure two things:

Is my feature working?
Is my feature working fast enough?

What if SLOs are not met?

You can categorise the reasons for not meeting SLOs basically into two categories:

Your system is not resilient enough
Your engineers can’t fix a problem fast enough

The issues with resiliency are frequently seen if you have temporary blips. For instance - you have a slow down of a page loading time, that lasts for 3 minutes and resolves itself.

The issues with implementing fixes can be identified when you experience an outage that doesn’t resolve itself and an engineer needs to action on it. For example - one endpoint frequently timeouts, because the database has no index for the query.

Sometimes the line between those two categories is blurry, and that’s ok. Just use your experience, take a guess and move on.

How to improve the metric to meet the SLO?

That depends on the category of the issue, so let’s break it down separately.

How to improve resiliency?

Throw money at your infrastructure and upgrade instances - increase the number of instances or the size of the instance. Or both. This solution however is not future-proof, becasue if the traffic/volumes increases, you’ll need to upgrade again.
Add autoscaling - this however can be expensive if the autoscaling is unlimited.
Optimise your code - instead of spending money on infrastructure, you can spend money on engineers’ salaries.

How to improve Mean Time To Recovery (MTTR)?

First is to build good dashboards. I personally like The Four Golden Signals defined in the Google SRE book.
Second, is to have good alerting. First and foremost, you should have alerts for the SLOs, to understand impact on the users. But then it’s worth to add more granular alerts, such as for saturation metrics.
Third, is to have good logging. That inlcudes structured logs, so it’s easy to query based on particular attributes. Also, if it’s a distributed system, tracing and traceIDs included in the logs is absolutely essential.
Fourth, is to run periodic drills. If you have alerts that go off only once a year, engineers might not be prepared to fix the issue. It’s a good idea to break down things on purpose in a separate environment, and train the engineers on how to fix it.

All of the above is my baseline. All of that should cover most of the cases, but bear in mind it’s not perfect. It won’t solve 100% of the issues, and you might need to adjust it.

But please keep it simple.