Netdata versus Datadog: root cause analysis with metric correlations

Sep 16, 2020 | Blog, Product

When an incident strikes, and every minute spent on root cause analysis delays the time to resolution, the real-world consequences can be dire. Troubleshooting an event requires a certain data set: every metric, at the greatest granularity, in one place, available in real time. Limits on the number or type of metrics, collection frequency, or time to visualization can mean the difference between timely resolution and unacceptable losses in time, money, and productivity.

Netdata was built to address the specific problem of making that data set available with no pre-planning and zero configuration, so you have every metric at per-second or better granularity across your entire infrastructure in real time in Netdata Cloud. That being said, even with the right dataset available, we still have the problem of manually identifying and analyzing the metrics affected by an incident and more importantly, the metrics that will lead us to pinpoint the incident’s root cause.

Introducing Metric Correlations

To address this problem, we have just introduced our first Insights feature in Netdata Cloud, Metric Correlations. Metric Correlations is an automated analytics tool that assesses all available metrics to find relevant correlations for a given time period. For a deep dive into the feature, be sure to review our blog post about the philosophy behind how it was built.

The primary impetus behind Metric Correlations is to speed mean time to resolution (MTTR). While Netdata is already known and loved for providing every metric at per-second interval in real time, sifting through thousands of metrics to analyze what is relevant is a massive time sink when every minute counts. In our tests, it took our team about 30 minutes to go through the available data. With Metric Correlations, that time decreased by an order of magnitude to just 3 minutes. Let’s take a closer look at how it works and how it compares to another monitoring solution, Datadog.

A simulation: troubleshooting a network traffic incident

As part of our research when building this feature, we did a review of how metric correlations are handled in other monitoring solutions. In the interest of full disclosure, we already knew that Netdata had some key advantages. First and foremost, Netdata collects and meaningfully presents every metric at per-second intervals in real time, something unavailable in competing solutions that rely on centralizing metrics as opposed to having a distributed data architecture like Netdata. This means that real-time troubleshooting can be challenging. No surprise there, since this was one of the things that spurred us on to build Netdata in the first place.

To demonstrate some of the key differences between how metric correlations work in Netdata versus other products, we set up a lab test to simulate a network traffic incident, then completed troubleshooting steps with both Netdata and Datadog.

The methodology

A primary virtual machine1(VM1) hosted an nginx web server, a Netdata agent (v1.24), and a Datadog agent (v7). Four secondary VMs were located in different geographic locations, and generated large network traffic requests to the primary virtual machine using siege2. As a result, outbound network traffic was created in VM1 for about 100s every 5 minutes.

Outbound network traffic on the Netdata Agent running on primary virtual machine (VM1)

1Primary Virtual machine was a AWS t2.xlarge that features 16 GiB of memory and 4 vCPUs t2 instances are backed by Intel Xeon processors with clock speeds up to 3.3 GHz.
2 */5 * * * * /usr/local/bin/siege -c250 -t100S XX.XX.XXX.XX:XX > /dev/null 2>&1

The monitoring and troubleshooting experience

We proceeded to use Datadog. The zero-configuration dashboard consisted of ~8 charts, and ~30 metrics. The metric correlation computation took into account only the metrics presented in the default dashboard (i.e. ~30), and not all available metrics that the Datadog Agent was collecting.

Datadog did not include auto-detection of applications, leaving nginx invisible as an application unless it is configured by the user. While the increase in received packets is visible at the system level, when the agent runs the metric correlations feature, it can only correlate 1 metric. That metric is self-evident, as it shows the increase in the number of sent packets in response to the number of request packets. To go deeper we would need to know what specific metrics we need to dive into and manually set up the dashboards to see them.

Netdata metric correlations work out of the box, without training, running against all 2,000+ real-time metrics, and presenting meaningful results in seconds.

Using Netdata and its metric correlations feature, we can quickly reduce the number of metrics to analyze from 2,000+ to 100. Of those, the only application chart that is correlated is nginx. We can then verify that the increased network traffic originates from traffic from the nginx server. It is worth mentioning that Netdata auto-detects all the applications for which there is a collector, thus the Agent can start collecting data immediately with zero configuration required.

Here is a summary of the key differences we saw in our tests:

Input: Number of metrics analysed which are available with zero configuration ~2,000 metrics
~600 charts
~30 metrics
~8 charts
Time to calculate results10-15 seconds for ~2,000 metrics10 seconds for 30 metrics
Output: Number of metrics identified as correlated (initial result)1231
Metric collection frequency1 second15 seconds
Time to train the metric correlation modelZeroVariable from hours to days
Was nginx identified as a correlated metricYesNo
Multi-node correlationNoYes
PriceFreeDatadog Enterprise plan at $23 per node/per month, with a minimum of 100 nodes, is $27,600 per year.

The idea behind this lab test was to show the novel approach of Netdata to real-time troubleshooting, which is our core focus. Netdata Cloud offers thousands of metrics out-of-the-box that can be analyzed in a few seconds in response to an anomaly, incident, or outage. Datadog offers cross-node correlations, but requires quite a bit more configuration to be able to surface similar insights and comes with considerable cost.

Ready to try it for yourself? Visit Netdata Cloud to create a free account and get started.

Not ready to jump in yet? Learn more about what went into building metric correlations or join the discussion on our Community forums.