Blog

Save Hours on Troubleshooting with Automated Investigations

Ask Netdata Anything, Get an Expert Analysis in Minutes
by Shyam Sreevalsan · August 4, 2025

How many times has your team stared at a dashboard, pointed to a spike, and asked a question that charts alone can’t answer? “What was the real impact of that deployment?” “Why are our Kubernetes pods in the us-east-1 cluster suddenly crashing?” “Are we wasting money on overprovisioned servers?”

Answering these questions is the real work of operations and SRE. It often kicks off a time-consuming scramble, sending engineers down rabbit holes for hours, days, or even weeks. You dig through logs, correlate metrics across services, and piece together clues from Slack conversations and Jira tickets.

What if you could skip the manual scramble? What if you could just… ask?

We’re thrilled to introduce a new way to interact with your systems: Netdata Investigations. This isn’t another dashboard or a rigid query language. It’s a conversational interface to your entire infrastructure, powered by your Co-SRE that does the deep-dive analysis for you.

Netdata AI: Your Co-SRE, Ready for Any Task

With Netdata Investigations, you can ask open-ended questions and receive a deeply researched report in minutes. Your Co-SRE uses the real-time, high-fidelity data that Netdata already collects from your servers, VMs, and applications to find an answer.

The key to a powerful investigation is context. Think about how you’d assign a task to a human teammate. You wouldn’t just say, “The site is slow.” You’d provide details from the support ticket, mention the recent deployment, and share your initial hypothesis. It’s the same with Netdata. The more context you provide—pasting in details from a Github discussion, a Jira ticket, or a Slack thread—the more insightful the report will be.

Investigations aren’t just for when things are on fire. You can use them to:

Troubleshoot a live incident: Instead of tying up your whole team chasing a theory, you can delegate multiple investigations to Netdata in parallel and get results back in minutes.

Analyze changes: Understand the performance impact of a new software release or a change in infrastructure configuration.

Optimize performance and cost: Find underutilized resources or identify bottlenecks before they impact users.

Explore trends: Get a summary of how system behavior has changed over the last week or month.

If you’re curious about how your servers, Kubernetes clusters, and applications are doing, you can now just ask them.

The Art of the Investigation: Examples

To get the best results, you need to provide good context. Here are a few examples that show how you can frame your requests for different scenarios.

Example 1: Troubleshooting a Problem

Your checkout service is failing. Instead of manually correlating pod restarts with deployment logs, you can ask Netdata to do it.

Your Request: Why are my checkout-service pods crashing repeatedly?

Your Context:

- Started after: deployment at 14:00 UTC of version 2.3.1
- Impact: Customer checkout failures, lost revenue ~$X/hour
- Recent changes: Updated payment gateway integration, increased worker threads from 10 to 20
- Error pattern in logs: "connection refused to payment-service:8080", "Java heap space"
- Environment: production / eks-prod-us-east-1
- Related services: payment-service, inventory-service, redis-session-store

Netdata will analyze metrics and logs from the checkout-service, correlate them with the behavior of payment-service around the deployment time, and connect the dots between the increased thread count and the heap space errors.

Example 2: Analyzing a Change

You just deployed a new authentication service and users are reporting strange behavior. You want to see the before-and-after picture.

Your Request: Compare system metrics before and after the recent user-authentication-service deployment.

Your Context:

- Service: user-authentication-service v2.2.0
- Deployed: 2025-01-24 09:00 UTC
- Changes: Switched from JWT to Redis sessions, added Argon2 password hashing
- Specific concerns: Users reporting intermittent logouts, suspicious increase in redis_connected_clients
- Time windows: 24h before deployment vs 24h after

The AI will generate a comparative report, highlighting changes in CPU, memory, network traffic, and Redis metrics, giving you a clear picture of the deployment’s impact.

Example 3: Optimizing for Cost

Your cloud bill is climbing. You suspect you’re overprovisioned, but need data to prove it.

Your Request: Identify underutilized nodes for cost optimization.

Your Context:

- Monthly AWS bill: $12K for compute
- Environment: Mixed workloads (prod + staging on same cluster)
- Known issues: Dev environments run 24/7, batch processing nodes idle 20h/day
- Goal: Find $2-3K/month in savings without impacting reliability

The investigation will analyze CPU, memory, and network utilization across your nodes, identifying candidates for downsizing or consolidation and quantifying the potential savings.

While Custom Investigations offer the ultimate flexibility for exploring complex issues, we know that much of an engineer’s time is spent reacting to the daily stream of notifications. For that, you can rely on its counterpart, Automated Alert Troubleshooting, which applies the same powerful AI analysis to triage and root-cause your alerts with a single click, freeing you up to focus on the bigger picture.

These new investigative tools are just the beginning of what’s possible with Netdata AI: Your Co-SRE. To understand the foundational platform that powers them and our broader vision for AI in observability, see our original post introducing Netdata Insights.

How to Start Your First Investigation

We’ve made it simple to launch an investigation from anywhere in Netdata.

From Anywhere in the UI: Click the new “Troubleshoot with AI” button in the top right corner of your screen. This will open the investigation panel and automatically capture the context of what you’re currently viewing (e.g., the specific chart, dashboard, or service). Add your question and any extra context, then start the investigation.

From the Insights Tab: Navigate to the “Insights” tab and click “New Investigation.” This gives you a blank canvas to start any kind of custom investigation you need.

Reports take about two minutes to generate and will appear in the Insights tab. You’ll also get an email as soon as your report is ready.

Coming Soon: We’re already working on the next evolution of this feature, including the ability to schedule recurring investigations. Imagine creating a custom SLO report template and having it run automatically every Monday morning, or scheduling a weekly cost-optimization analysis.

We’re Just Getting Started

This is our first release of Netdata Investigations, and we are obsessively focused on improving the quality and depth of the analysis. Expect to see new capabilities and even more powerful insights soon.

This functionality is now available in preview mode.

All users on a Business plan will get 10 AI investigation sessions per month
All newly signed-up users will get 10 AI investigation sessions as part of their free trial
Community users who want to try the feature can request access by contacting us at [email protected]

We will also be introducing credit bundles soon, allowing you to purchase investigation sessions on-demand to meet your needs, always at a cost-effective price.

Stop chasing ghosts in your data. Start asking questions and get the answers you need to move faster and build more reliable systems.

Try it today in Netdata Cloud →