Observability

Monitor Everything is an Anti-Pattern!

by Costa Tsaousis · December 1, 2025

Bullshit and nonsense.

But let’s take it from the beginning.

The industry’s story goes something like this:

“Monitor everything is universally recognized as an anti-pattern.”
“You’ll drown in metrics, burn out your engineers, and blow your budget.”
“Just focus on 3–10 signals — the Four Golden Signals, RED, USE — and ignore everything else.”
“Trust us, you don’t want that much telemetry.”

(true, read the whole story here)

Then, in the same breath:

“During the incident, we discovered we had monitoring gaps.”
“We had to build new dashboards in the middle of the outage.”
“Our tooling couldn’t distinguish internal vs external failures.”
“We had no visibility into that dependency.”
“We’re adding new dashboards now.”

82% of organizations now have MTTR over 1 hour (and rising),
51% admit they lack sufficient observability,
and only 10–14% have mature capabilities.

(also true, read the whole story here)

You see the contradiction already, right?

The same industry that tells you “collect less, simplify, trust the experts” is also the industry where:

Cloudflare had a 36-hour outage because their observability stack couldn’t see a power source change.
Resend’s alerts auto-resolved themselves before escalating.
AWS, Datadog, and others consistently discover blind spots during incidents.
Dashboards fail to reflect business impact until after real users scream.

This isn’t an observability strategy. It’s observability by hindsight.

Right. Good. Now we’re having fun.

I think it’s absolutely fair to call bullshit on how “monitor everything is an anti-pattern” is usually presented. Because here’s the trick: all the research is mostly factually correct — but the conclusion is wrong.

And it’s wrong because the entire discussion confuses technique with architecture, and cost models with capability.

The truth is simple:

The pain is real.
The cause is misdiagnosed.

And Netdata solves the problem at the architectural level instead of telling you to lower your expectations.

Let me untangle this cleanly.

The key distinction nobody else seems able to articulate

Everyone loves to debate “monitoring too much.” Almost nobody distinguishes between:

Instrumentation / collection: what the system knows.
Shipping / retention / storage: where the data lives, at what cost, and at what resolution.
Attention / action: what humans see, and what pages humans at 3 a.m.

These layers are not the same. Yet most vendors treat them as if they are.

So they end up giving you blanket dogmas like:

“Don’t collect too much.”
“Don’t increase cardinality.”
“Don’t store high-resolution data.”
“Don’t monitor everything — it’s an anti-pattern.”

Why? Because their architectures can’t handle it.

Their collectors are heavy. Their pipelines are centralized. Their storage is priced by the byte and by the series. Their business model depends on you not sending them too much data.

When a SaaS vendor tells you “monitor everything is an anti-pattern,” what they mean is:

“Please don’t send us everything because our cloud bill will explode before yours does.”

Netdata flips this entirely.

Netdata’s philosophy is:

Layer 1: instrument the hell out of everything. Cheap, local, per-second, automatic.
Layer 2: store it on the node. No $/metric billing. No cardinality tax. No centralized choke point.
Layer 3: be ruthless about what hits human attention. Alerts on symptoms. Dashboards that prioritize what matters now. ML and correlations surface the rest when needed.

This is why Netdata can say yes to the ambition (“see everything”) without burning you.

Everyone else says no because their architecture forces them to.

Metric fatigue? A presentation failure, not a data failure.

Metric fatigue is real — but it happens at the human layer.

If you give people:

20,000 static charts
no hierarchy
no grouping
no ML
no “start here, drill down there” logic

…then of course they stop looking.

That’s not because you collected too much. It’s because the UI is a wall of chaos.

Netdata fixes this by:

Automatically building layered dashboards
Grouping everything by role, app, process, container, network interface, disk, DB, etc.
Using ML anomalies and correlations to point to what matters
Making the long tail of metrics available only when someone drills down

You don’t scroll through noise. You follow the trail of evidence.

Alert fatigue? A policy failure, not a telemetry failure.

Alert fatigue happens in systems that page on:

Raw CPU percentages
Latency jitter
Every spike
Every error
Every dip
Every threshold breach

That’s just sloppy alert design.

Netdata’s stance:

Collect everything
Alert on almost nothing
Prioritize symptoms, not raw counters
Correlate before you notify
Keep human attention sacred

When classic tools force you to choose:

collect little (fly blind)
or collect everything (drown)

Netdata eliminates the trade-off entirely.

High cardinality? A billing problem, not a physics problem.

Centralized TSDB vendors have a natural enemy: cardinality.

Every new label combination = new bill.

Every new dimension = new cardinality explosion.

Every new service = new cost multiplier.

So of course they tell you “don’t monitor everything.”

Netdata solves this the only correct way:

store high-res metrics locally, at the edge
ship only what you choose
no per-series billing
no centralized bottleneck
no latency
no resource starvation
no “sorry, you hit your tier limit”

The conclusion is obvious:

The data was never the problem.
The architecture was the problem.

The hidden hypocrisy: vendors preaching limitations they created

Let’s be blunt.

Most of the anti-pattern rhetoric comes from vendors whose products simply cannot ingest, store, or process the volume of data modern systems generate at the edge.

So what do they do?

They don’t fix the architecture. They redefine “best practice” around their limitations.

They turn scarcity into doctrine.

And then customers internalize this doctrine as gospel.

Meanwhile, during every real outage, teams discover that:

the missing metric was important
the missing label was critical
the missing dependency was the cause

…and they add it after the fact.

This cycle repeats endlessly.

Netdata breaks that cycle by ensuring the data is always there before you even know you’ll need it.

What’s actually non-negotiable: actionability at the human boundary

Here is where the SRE community is 100% right:

Alerts must be rare.
Alerts must be actionable.
Dashboards must reduce cognitive load.
On-call must be humane.

What they get wrong is: “therefore you should collect less.”

No.

You should show humans less. You should page humans rarely.

You should not delete telemetry to achieve this.

Netdata collects everything precisely so it can automatically highlight the right subset when something breaks:

anomalies
correlated metrics
degraded SLOs
saturation symptoms
per-process breakdowns
per-container patterns
per-second context

That’s why Netdata feels like cheating when you debug — because you always have the missing dimension.

The punchline

If I had to summarize the whole thing in one line:

Treating every metric as equally important is an anti-pattern.
Not having the metric when you need it is an even worse one.

Netdata lives right in the middle:

We refuse to lose data you’ll need in the worst hour of your year.
We refuse to shove all of that data in your face or page you with it.
We architect so that “everything” is cheap and local, and “what matters now” is global and human.

So yes — you can keep all the facts from the anti-pattern research. You should reject their conclusion. And you should weaponize it:

The industry is right about the pain of naive “monitor everything."
Netdata exists so you can have “monitor everything” without the pain.

This is not bullshit. This is just what happens when architecture finally catches up to the ambition.