How to do observability of distributed systems

20 May 2025

This is a guide how to set up observability in a fairly mature company.

I assume that you have a product-market fit. If you work in a scale-up or an enterprise, you might find something useful here. If you’re building a startup and you have no idea whether the app will work or not, this guide is not for you.

This guide also assumes that you already have some monitoring in place. Or at least infrastructure for observability. That means, you might already have some metrics, alerts and logs, but it’s a mess, it takes a lot of time to fix issues and frequently you hear about problems from your users.

Start with defining SLOs for features

First, you define SLOs for user facing features.

A very good example of a breakdown and uptime definitions are status pages.

Incident.IO is a particularly good example, because it not only has a breakdown of features, but also shows a history of incidents and defines the uptime.

How to define an SLO? I have a rule of thumb: I want to measure two things:

What if SLOs are not met?

You can categorise the reasons for not meeting SLOs basically into two categories:

The issues with resiliency are frequently seen if you have temporary blips. For instance - you have a slow down of a page loading time, that lasts for 3 minutes and resolves itself.

The issues with implementing fixes can be identified when you experience an outage that doesn’t resolve itself and an engineer needs to action on it. For example - one endpoint frequently timeouts, because the database has no index for the query.

Sometimes the line between those two categories is blurry, and that’s ok. Just use your experience, take a guess and move on.

How to improve the metric to meet the SLO?

That depends on the category of the issue, so let’s break it down separately.

How to improve resiliency?

How to improve Mean Time To Recovery (MTTR)?

All of the above is my baseline. All of that should cover most of the cases, but bear in mind it’s not perfect. It won’t solve 100% of the issues, and you might need to adjust it.

But please keep it simple.