Test Driving Machine Learning (ML) Anomaly Advisor

Netdata’s new Anomaly Advisor feature lets you quickly identify potentially anomalous metrics during a particular timeline of interest. This results in considerably speeding up your troubleshooting workflow and saving valuable time when faced with an outage or issue you are trying to root cause. 

Anomaly Advisor uses machine learning to detect if any one of the thousands of metrics that Netdata monitors is behaving anomalously. Thousands of machine learning models (one per metric) are trained at the edge on the Netdata agent running in your system – preserving privacy by not storing your metric data on our servers. And as always, the Netdata agent is incredibly lightweight, even considering the ML training and inference that are required by Anomaly Advisor. To read more about how Netdata does anomaly detection, you can head on over to our docs or reach out to us on our Community Discord.

We always ‘dogfood’ our new features extensively within Netdata; this means we use our product ourselves for monitoring and troubleshooting our own production servers, staging labs and home labs. Anomaly Advisor is no exception and has been tested in various different environments by a diverse set of users. 

Costa Tsaousis, Netdata’s founder and CEO, was one of our first alpha testers and got his hands on an early build of the feature. Costa ran Anomaly Advisor on his cluster of Raspberry Pi nodes and almost immediately stumbled upon a real world bugin Raspbian (the Linux distro used by Raspberry Pi)! It was a pleasant surprise to see the feature fulfilling its purpose, even during alpha testing.

So let’s dive a little deeper into the bug that was identified. Writing to /dev/null uses up a LOT of CPU, way more than you would expect. In fact it eats up half a core just to do this. 

You can reproduce this yourself if you have a Raspberry Pi, by just running the following command:

On Netdata’s new anomalies tab you will see a spike in anomaly rate that corresponds to running the above command:

Highlighting this area of the chart will bring up all of the metrics that were anomalous during that time, you may see metrics that you can discount such as ssh related metrics if you just newly logged into the device to run this test. 

What’s interesting is that you’ll also see these charts that are related to the X server – System CPU, CPU and Logical reads. This is interesting and is a clue that helps us onto the right troubleshooting track.

Expanding one of these charts – the System CPU chart for example – tells us that yes, in fact just running a while loop writing nothing to /dev/null is consuming nearly 50% CPU.

Using these clues and going back to the terminal and running top points us to the process “pipewire-media” which is the CPU hog when the command is run. Netdata cannot at the moment point you to this process from within the chart 

What is pipewire-media? And why does Netdata report it as an X server metric? Let’s run “ps fax” to find out more. 

You can see that pipewire-media is in the same process tree as systemd, and opening up the Netdata apps groups configuration will show that systemd is reported under the X app. Currently Netdata does not go to this extra step of automatically identifying the process in concern; this is something we are still working on to make this troubleshooting process even more simpler and quicker.

Now, what happens if we remove the IO action of sending to /dev/null and stick with the same infinite loop.

This time there’s no CPU spike, and no anomalies triggered in Anomaly Advisor either – so the problem can clearly be localized to the IO event.

So in conclusion, there is a pretty severe bug in Raspbian which consumes up to half a core in CPU just by sending emptiness to /dev/null and Netdata’s Anomaly Advisor helped us root cause this problem in a few minutes – instead of potentially spending hours or days figuring out why the CPU runs hot and which app or script or process causes it to do so.