Costa Tsaousis spoke at Conf42 Cloud Native 2024 (virtual, March 2024) with “Practical AI with Machine Learning for Observability in Netdata.” The talk was a technical walkthrough of how Netdata applies unsupervised machine learning to metrics – not as a feature checkbox, but as a way to surface problems that static thresholds miss.
The key insight Costa presented: individual anomalies on individual metrics are often noise. A CPU spike on one node, a latency bump on one service – these happen constantly and mean nothing on their own. But when anomalies converge across multiple metrics and services simultaneously, that convergence is a strong signal that something unusual is actually happening. As he put it: “The power of ML becomes evident when seemingly noisy anomalies converge across various services, serving as indicators of something exceedingly unusual.”
Netdata runs ML models at the edge, on each node, without requiring labeled training data or a centralized ML backend. Each model learns what “normal” looks like for its specific metric and flags deviations. The interesting part is not any single model’s output – it is the correlation layer that watches for coordinated anomalies across the infrastructure.