Netdata Now Troubleshoots Your Alerts for You

Your AI Co-Engineer for Alert Investigation

Netdata Now Troubleshoots Your Alerts for You

The 2 AM pager alert. For anyone in Ops, SRE, or IT administration, those words trigger a familiar sense of dread. An alert has fired. Is it a real fire, or another false alarm waking you from a dead sleep? The pressure is on. Every minute of downtime costs money and reputation, but troubleshooting a complex system when you’re sleep-deprived is a Herculean task.

This cycle is a massive drain on engineering resources. The daily grind of sifting through alerts, trying to distinguish signal from noise, and manually correlating metrics to find a root cause consumes countless hours. This constant firefighting leads to alert fatigue, where even critical notifications start to get ignored. The core questions are always the same: Is this a real problem? What is the potential impact? Why did this trigger? What do I do next? Answering them is a slow, manual, and often stressful process.

At Netdata, we’ve always focused on reducing this burden. Our platform ships with hundreds of pre-configured, zero-configuration alerts that give you immediate coverage. But we know, and have heard from so many of you that we could do more. Troubleshooting alerts, understanding what is really going on and what to do about it was still a major pain point for our users.

Until now.

Your New AI Co-Engineer for Troubleshooting

We’re excited to introduce the next step in our AI journey: Automated Alert Troubleshooting.

Think of it as your new AI co-engineer. Its job is to perform the initial, time-consuming investigation for you. When an alert triggers, you no longer start your analysis from zero. Instead, you begin with a comprehensive report that has already done the heavy lifting.

With the click of a button, Netdata’s AI will:

  • Analyze the Alert: It examines the alert’s history and the behavior of the underlying metrics to determine if it looks like a transient false positive or a legitimate issue.

  • Identify Correlated Factors: The AI engine scans thousands of other metrics and log patterns across your infrastructure to find what else was behaving abnormally at the same time. This helps uncover hidden dependencies and potential root causes that are incredibly difficult for a human to spot.

  • Provide a Root Cause Hypothesis: Based on its analysis, the AI provides a summary of its findings and suggests a likely root cause, pointing you directly to the metrics or dimensions that matter most.

This means you can skip the grunt work of data collection and correlation. You can jump straight to validation and resolution, armed with a powerful baseline analysis.

Putting Your AI Co-Engineer to Work

We’ve integrated this new capability directly into your existing workflows, making it simple to access whenever you need it. You can trigger an AI troubleshooting investigation in several ways:

  1. From the Alerts Tab: When viewing your active or recent alerts, you’ll see a new “Ask AI” button next to each one. Click it, and Netdata will begin the investigation.

image

  1. From the Insights Tab: Navigate to the “Insights” tab and select “Alert Troubleshooting” from the investigations section. From there, you can choose any recent alert from the dropdown menu to begin your analysis.

image

  1. From Your Alert Notifications: When you receive an alert notification via email, you will now see an option to “Troubleshoot with AI”. Clicking this link will automatically kick off the investigation for that specific alert.

An analysis typically takes one to two minutes to complete. Once ready, the report will appear in your Alerts tab and be saved in the Insights tab under “Investigations”. You’ll also receive an email notification letting you know your report is ready for review.

image

Coming Soon: We are already working on the ability to have the AI co-engineer work proactively. Soon, you will be able to schedule automatic troubleshooting for specific alerts. Any time a new alert of that type fires, an investigation will be run automatically, ensuring the analysis is waiting for you the moment you see the notification.

These capabilities are the first of many powered by our new AI co-engineer. To learn more about the core technology and our vision for a proactive future that goes beyond reactive troubleshooting, read our introduction to the Netdata Insights platform.

The First Step on a New Journey

This is the first iteration of our Automated Alert Troubleshooting feature. Our team is working continuously to refine the AI models, improve the quality and depth of the analysis, and expand its capabilities. You can expect to see significant enhancements and new functionality in the near future as we teach our AI co-engineer to handle even more complex scenarios.

How to Get Access

Automated Alert Troubleshooting is now available in preview mode.

  • All users on a Business plan will get 10 AI troubleshooting sessions per month during their Business plan trial
  • All newly signed-up users will get 10 AI troubleshooting as part of their free trial
  • Community users who want to try the feature can request access by contacting us at product@netdata.cloud

Soon, we will also be offering credit bundles, allowing you to purchase troubleshooting sessions and other AI-powered insights on-demand, all at the cost-optimal pricing you expect from Netdata.

Stop wasting time on the initial drudgery of alert investigation. Let your AI co-engineer handle the first pass and get back your most valuable resource: time to build, innovate, and solve real problems.

Try it today in Netdata Cloud →

P.S: We know that the alert customization and new alert creation experience can be a lot simpler and smoother and we are tackling this problem head on as well, but in the meantime automated troubleshooting can hopefully take the edge off your alert fatigue.