Use Case · Alert Fatigue

Stop managing alert fatigue. Prevent it.

Q: What's the real cause of alert fatigue?

Two causes dominate. First, static thresholds that don’t track workload changes — a threshold set six months ago fires constantly on healthy infrastructure today. Second, single-model anomaly detection that fires on every deviation — without consensus voting, normal noise looks anomalous. Both are mechanism problems, not headcount problems. No amount of better on-call scheduling fixes them.

Q: Can I still write threshold-based alerts when I want to?

Yes. Static-threshold alerts are still useful for SLO breaches, hard capacity ceilings, and known-failure-mode signatures. Netdata supports them alongside the ML-driven anomaly signals. The advantage is that you can now reserve threshold alerts for the cases where they genuinely matter — instead of using them as a stand-in for anomaly detection on every metric.

Q: How does Netdata's pricing work for ML-on-every-metric?

Per node. Cloud Business starts at $4.5 per node per month on annual plans, and the per-node price decreases as your node count grows. ML and anomaly detection are included — not a paid AI tier. The architecture works because the models run on the agent itself, not in a per-metric-billed SaaS backend. There are no per-metric, per-ingest, per-user, or per-seat fees layered on top.

Most alert-fatigue tools cluster, correlate, and suppress the noise after it fires. That’s symptom management. Netdata prevents the noise at the source — per-metric ML with consensus voting means only anomalies that multiple independent models agree on ever become alerts.

Start Free Trial See Live Demo

Static thresholds set six months ago

Workload changed; thresholds didn’t.

Alerts that fire every shift on healthy infrastructure

Operators silence the alert. Then real incidents fire the same alert.

One ML model per metric with no consensus check

A single statistical baseline catches normal noise as ‘anomaly’.

AIOps that pages on every traffic spike

On-call learns to dismiss the AI’s alerts. The AI tier was a waste.

Cardinality-limited monitoring forces aggregation

The label that would have explained which user/region/pod was affected got dropped.

Alerts that fire but can’t be diagnosed

Operator spends 30 minutes pulling logs to figure out what the alert was even about.

Alerts on every individual metric in isolation

A cascading failure trips 50 alerts, each from a different metric in the same chain.

Pager floods on real incidents — the worst time for noise

Operator spends the first 15 minutes muting alerts before triaging.

18 ML models per metric

Each metric gets 18 independent unsupervised models trained on rolling windows. An anomaly is declared only when a consensus of models agrees — single-model false positives are suppressed automatically.

Per-second granularity

Transient spikes that look like noise at 60-second resolution are visible at 1-second. The alert is fired for the right reason, not the obvious symptom.

Per-metric ML, not curated subset

Per-node pricing means scoring every collected metric is economical. No ‘pick your golden signals’ tradeoff — every metric has anomaly detection on it.

Anomaly Advisor consolidates alerts

Anomaly Advisor groups correlated anomalies during an incident and ranks them by relevance — the operator sees the top 30–50 causally-related metrics, not 500 individual pages.

Rolling baselines, not static thresholds

Models retrain continuously on recent data, so the baseline tracks workload changes automatically. No quarterly threshold re-tuning ritual.

Native routing to PagerDuty, Slack, OpsGenie, Teams

When an alert does fire, it routes to your existing on-call workflow without an integration project. Suppression at the source means fewer pages reach the rotation.

Alert fatigue approaches compared

How vendors approach alert noise

Two strategies dominate the category. Most vendors manage noise after the fact; Netdata’s strategy is to suppress false positives at the source.

Where the work happens

At the metric — anomaly only fires if consensus of 18 models agrees

At the alert stream — clusters and deduplicates alerts already firing

False-positive suppression mechanism

Consensus voting across independent ML models

Statistical clustering of historically-correlated alerts

Per-metric ML coverage

Every collected metric scored continuously

Typically none — relies on upstream tool’s alerts

Pricing dimension

Per-node (ML included)

Per-event or quote-based enterprise

Works without upstream tooling?

Yes — Netdata generates the underlying signals

No — requires existing alerting infrastructure

Latency to suppression

Suppressed before the alert fires (zero false-positive pages)

Suppressed after the cluster groups the alerts (page already fired)

The mechanics of consensus-vote anomaly detection

Per-metric models, not curated golden signals

Most observability platforms apply ML to a small curated set of metrics — typically the four golden signals — because their pricing dynamics can’t afford to score every metric. Netdata’s per-node pricing model lets the agent run ML on every metric it collects, at the edge, without a per-metric cost dimension.

18 models / metric

How Netdata's anomaly detection works

Per-metric models, not curated golden signals

Consensus voting suppresses the false-positive flood

Each of the 18 models is trained on a different rolling window. When new data arrives, all 18 score it independently. An anomaly is declared only when a consensus of models agrees — a threshold that suppresses the single-model false positives that drive most alert fatigue. The operator’s page only fires when 18 independent baselines, trained on different recent histories, all flag the same observation as off-normal.

Multi-model consensus

Read about the Anomaly Advisor

Consensus voting suppresses the false-positive flood

Anomaly Advisor groups correlated incidents

When a real incident happens, dozens of metrics across the affected service typically anomalize together. The Anomaly Advisor groups these into a single timeline view ranked by relevance — operators see the top 30–50 correlated metrics, not 500 individual pages. The first-responder workflow shifts from ‘mute the alerts’ to ‘read the timeline.’

Top 30–50 ranked

Explore real-time troubleshooting

Anomaly Advisor groups correlated incidents

Frequently asked questions

What’s the real cause of alert fatigue?

How is alert prevention different from alert correlation?

Alert correlation (BigPanda, Moogsoft) clusters alerts that have already fired and suppresses duplicates. It’s symptom management — it reduces the volume of pages but doesn’t change the underlying signal quality. Alert prevention works at the source: anomaly detection only triggers an alert when multiple independent ML models agree something is wrong. The false-positive page never gets generated. Both approaches are valid; the prevention approach has a lower ceiling on noise and a fundamentally different cost model.

Why do 18 ML models per metric reduce false positives more than one?

Each of the 18 models is trained on a different rolling time window — they have different views of ‘recent normal.’ When new data arrives, all 18 score it independently. If only one model thinks the data is anomalous, that’s most likely a model-specific quirk (recent training window happened to be unusually narrow, say). If a consensus of models agrees, the signal is real. Mathematically, the consensus-vote architecture dramatically reduces the false-positive rate compared to single-model statistical baselines or threshold-based alerts.

Does this work for metrics I haven’t told it about?

Yes. Netdata auto-discovers installed services on install and begins collecting their metrics immediately. Every collected metric gets ML treatment — there’s no ’enable anomaly detection for this metric’ configuration step. The reason most vendors do require that step is that their pricing dynamics make it expensive to score every metric. Netdata’s per-node model includes scoring as part of the agent’s job.

Can I still write threshold-based alerts when I want to?

How does Netdata’s pricing work for ML-on-every-metric?