The only agent that thinks for itself

Autonomous Monitoring with self-learning AI built-in, operating independently across your entire stack.

Unlimited Metrics & Logs
Machine learning & MCP
5% CPU, 150MB RAM
3GB disk, >1 year retention
800+ integrations, zero config
Dashboards, alerts out of the box
> Discover Netdata Agents
Centralized metrics streaming and storage

Aggregate metrics from multiple agents into centralized Parent nodes for unified monitoring across your infrastructure.

Stream from unlimited agents
Long-term data retention
High availability clustering
Data replication & backup
Scalable architecture
Enterprise-grade security
> Learn about Parents
Fully managed cloud platform

Access your monitoring data from anywhere with our SaaS platform. No infrastructure to manage, automatic updates, and global availability.

Zero infrastructure management
99.9% uptime SLA
Global data centers
Automatic updates & patches
Enterprise SSO & RBAC
SOC2 & ISO certified
> Explore Netdata Cloud
Deploy Netdata Cloud in your infrastructure

Run the full Netdata Cloud platform on-premises for complete data sovereignty and compliance with your security policies.

Complete data sovereignty
Air-gapped deployment
Custom compliance controls
Private network integration
Dedicated support team
Kubernetes & Docker support
> Learn about Cloud On-Premises
Powerful, intuitive monitoring interface

Modern, responsive UI built for real-time troubleshooting with customizable dashboards and advanced visualization capabilities.

Real-time chart updates
Customizable dashboards
Dark & light themes
Advanced filtering & search
Responsive on all devices
Collaboration features
> Explore Netdata UI
Monitor on the go

Native iOS and Android apps bring full monitoring capabilities to your mobile device with real-time alerts and notifications.

iOS & Android apps
Push notifications
Touch-optimized interface
Offline data access
Biometric authentication
Widget support
> Download apps

Best energy efficiency

True real-time per-second

100% automated zero config

Centralized observability

Multi-year retention

High availability built-in

Zero maintenance

Always up-to-date

Enterprise security

Complete data control

Air-gap ready

Compliance certified

Millisecond responsiveness

Infinite zoom & pan

Works on any device

Native performance

Instant alerts

Monitor anywhere

80% Faster Incident Resolution
AI-powered troubleshooting from detection, to root cause and blast radius identification, to reporting.
True Real-Time and Simple, even at Scale
Linearly and infinitely scalable full-stack observability, that can be deployed even mid-crisis.
90% Cost Reduction, Full Fidelity
Instead of centralizing the data, Netdata distributes the code, eliminating pipelines and complexity.
Control Without Surrender
SOC 2 Type 2 certified with every metric kept on your infrastructure.
Integrations

800+ collectors and notification channels, auto-discovered and ready out of the box.

800+ data collectors
Auto-discovery & zero config
Cloud, infra, app protocols
Notifications out of the box
> Explore integrations
Real Results
46% Cost Reduction

Reduced monitoring costs by 46% while cutting staff overhead by 67%.

— Leonardo Antunez, Codyas

Zero Pipeline

No data shipping. No central storage costs. Query at the edge.

From Our Users
"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

No Query Language

Point-and-click troubleshooting. No PromQL, no LogQL, no learning curve.

Enterprise Ready
67% Less Staff, 46% Cost Cut

Enterprise efficiency without enterprise complexity—real ROI from day one.

— Leonardo Antunez, Codyas

SOC 2 Type 2 Certified

Zero data egress. Only metadata reaches the cloud. Your metrics stay on your infrastructure.

Full Coverage
800+ Collectors

Auto-discovered and configured. No manual setup required.

Any Notification Channel

Slack, PagerDuty, Teams, email, webhooks—all built-in.

Built for the People Who Get Paged
Because 3am alerts deserve instant answers, not hour-long hunts.
Every Industry Has Rules. We Master Them.
See how healthcare, finance, and government teams cut monitoring costs 90% while staying audit-ready.
Monitor Any Technology. Configure Nothing.
Install the agent. It already knows your stack.
From Our Users
"A Rare Unicorn"

Netdata gives more than you invest in it. A rare unicorn that obeys the Pareto rule.

— Eduard Porquet Mateu, TMB Barcelona

99% Downtime Reduction

Reduced website downtime by 99% and cloud bill by 30% using Netdata alerts.

— Falkland Islands Government

Real Savings
30% Cloud Cost Reduction

Optimized resource allocation based on Netdata alerts cut cloud spending by 30%.

— Falkland Islands Government

46% Cost Cut

Reduced monitoring staff by 67% while cutting operational costs by 46%.

— Codyas

Real Coverage
"Plugin for Everything"

Netdata has agent capacity or a plugin for everything, including Windows and Kubernetes.

— Eduard Porquet Mateu, TMB Barcelona

"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

Real Speed
Troubleshooting in 30 Seconds

From 2-3 minutes to 30 seconds—instant visibility into any node issue.

— Matthew Artist, Nodecraft

20% Downtime Reduction

20% less downtime and 40% budget optimization from out-of-the-box monitoring.

— Simon Beginn, LANCOM Systems

Pay per Node. Unlimited Everything Else.

One price per node. Unlimited metrics, logs, users, and retention. No per-GB surprises.

Free tier—forever
No metric limits or caps
Retention you control
Cancel anytime
> See pricing plans
What's Your Monitoring Really Costing You?

Most teams overpay by 40-60%. Let's find out why.

Expose hidden metric charges
Calculate tool consolidation
Customers report 30-67% savings
Results in under 60 seconds
> See what you're really paying
Your Infrastructure Is Unique. Let's Talk.

Because monitoring 10 nodes is different from monitoring 10,000.

On-prem & air-gapped deployment
Volume pricing & agreements
Architecture review for your scale
Compliance & security support
> Start a conversation
Monitoring That Sells Itself

Deploy in minutes. Impress clients in hours. Earn recurring revenue for years.

30-second live demos close deals
Zero config = zero support burden
Competitive margins & deal protection
Response in 48 hours
> Apply to partner
Per-Second Metrics at Homelab Prices

Same engine, same dashboards, same ML. Just priced for tinkerers.

Community: Free forever · 5 nodes · non-commercial
Homelab: $90/yr · unlimited nodes · fair usage
> Start monitoring your lab—free
$1,000 Per Referral. Unlimited Referrals.

Your colleagues get 10% off. You get 10% commission. Everyone wins.

10% of subscriptions, up to $1,000 each
Track earnings inside Netdata Cloud
PayPal/Venmo payouts in 3-4 weeks
No caps, no complexity
> Get your referral link
Cost Proof
40% Budget Optimization

"Netdata's significant positive impact" — LANCOM Systems

Calculate Your Savings

Compare vs Datadog, Grafana, Dynatrace

Savings Proof
46% Cost Reduction

"Cut costs by 46%, staff by 67%" — Codyas

30% Cloud Bill Savings

"Reduced cloud bill by 30%" — Falkland Islands Gov

Enterprise Proof
"Better Than Combined Alternatives"

"Better observability with Netdata than combining other tools." — TMB Barcelona

Real Engineers, <24h Response

DPA, SLAs, on-prem, volume pricing

Why Partners Win
Demo Live Infrastructure

One command, 30 seconds, real data—no sandbox needed

Zero Tickets, High Margins

Auto-config + per-node pricing = predictable profit

Homelab Ready
"Absolutely Incredible"

"We tested every monitoring system under the sun." — Benjamin Gabler, CEO Rocket.Net

76k+ GitHub Stars

3rd most starred monitoring project

Worth Recommending
Product That Delivers

Customers report 40-67% cost cuts, 99% downtime reduction

Zero Risk to Your Rep

Free tier lets them try before they buy

Never Fight Fires Alone

Docs, community, and expert help—pick your path to resolution.

Learn.netdata.cloud docs
Discord, Forums, GitHub
Premium support available
> Get answers now
60 Seconds to First Dashboard

One command to install. Zero config. 850+ integrations documented.

Linux, Windows, K8s, Docker
Auto-discovers your stack
> Read our documentation
See Netdata in Action

Watch real-time monitoring in action—demos, tutorials, and engineering deep dives.

Product demos and walkthroughs
Real infrastructure, not staged
> Start with the 3-minute tour
Level Up Your Monitoring
Real problems. Real solutions. 112+ guides from basic monitoring to AI observability.
76,000+ Engineers Strong
615+ contributors. 1.5M daily downloads. One mission: simplify observability.
Per-Second. 90% Cheaper. Data Stays Home.
Side-by-side comparisons: costs, real-time granularity, and data sovereignty for every major tool.

See why teams switch from Datadog, Prometheus, Grafana, and more.

> Browse all comparisons
Edge-Native Observability, Born Open Source
Per-second visibility, ML on every metric, and data that never leaves your infrastructure.
Founded in 2016
615+ contributors worldwide
Remote-first, engineering-driven
Open source first
> Read our story
Promises We Publish—and Prove
12 principles backed by open code, independent validation, and measurable outcomes.
Open source, peer-reviewed
Zero config, instant value
Data sovereignty by design
Aligned pricing, no surprises
> See all 12 principles
Edge-Native, AI-Ready, 100% Open
76k+ stars. Full ML, AI, and automation—GPLv3+, not premium add-ons.
76,000+ GitHub stars
GPLv3+ licensed forever
ML on every metric, included
Zero vendor lock-in
> Explore our open source
Build Real-Time Observability for the World
Remote-first team shipping per-second monitoring with ML on every metric.
Remote-first, fully distributed
Open source (76k+ stars)
Challenging technical problems
Your code on millions of systems
> See open roles
Talk to a Netdata Human in <24 Hours
Sales, partnerships, press, or professional services—real engineers, fast answers.
Discuss your observability needs
Pricing and volume discounts
Partnership opportunities
Media and press inquiries
> Book a conversation
Your Data. Your Rules.
On-prem data, cloud control plane, transparent terms.
Trust & Scale
76,000+ GitHub Stars

One of the most popular open-source monitoring projects

SOC 2 Type 2 Certified

Enterprise-grade security and compliance

Data Sovereignty

Your metrics stay on your infrastructure

Validated
University of Amsterdam

"Most energy-efficient monitoring solution" — ICSOC 2023, peer-reviewed

ADASTEC (Autonomous Driving)

"Doesn't miss alerts—mission-critical trust for safety software"

Community Stats
615+ Contributors

Global community improving monitoring for everyone

1.5M+ Downloads/Day

Trusted by teams worldwide

GPLv3+ Licensed

Free forever, fully open source agent

Why Join?
Remote-First

Work from anywhere, async-friendly culture

Impact at Scale

Your work helps millions of systems

Compliance
SOC 2 Type 2

Audited security controls

GDPR Ready

Data stays on your infrastructure

Blog

Data Collection Strategies for Infrastructure Monitoring – Troubleshooting Specifics

Tailoring Data Collection for Targeted Troubleshooting and Insights
by Alex Malkov · September 6, 2022

Monitoring and troubleshooting; unfortunately, these terms are still used interchangeably, which can lead to misunderstandings about data collection strategies.

In this article we aim to clarify some important definitions, processes, and common data collection strategies for monitoring solutions. We will specify the limitations of the described strategies, as well as key benefits which can potentially be also used for troubleshooting needs. 

IT infrastructure monitoring is a business process of collecting and analyzing data over a period of time to improve business results.

Troubleshooting is a form of problem solving, often applied to repair failed products, services, or processes on a machine or a system. The main purpose of troubleshooting is to identify the source of a problem in order to solve it.

In short, monitoring is used to observe the current/historical state while troubleshooting is employed to isolate the specific cause or causes of the symptom.

The boundary between the definitions of the terms monitoring and troubleshooting is clear; however, in the context of currently available monitoring solutions on the market for software engineers, SRE, DevOps, etc., the boundary has become a bit blurry.

The basic monitoring system can be built on top of “The Three Pillars of Observability” (Sridharan, 2018): Logs, metrics, and traces. All of them together have the ability to provide visibility into the health of your systems, behavior of your services as well as some business metrics. You would be able to understand the impact of changes you make or change in the users traffic patterns.

The main focus in this article is going to be on metrics rather than on logs or traces.

Metrics represent the measurements of resource usage or behavior - for example, low-level usage summaries provided by your operating system (CPU load, Memory usage, Disk space, etc) or higher-level summaries provided by a specific process currently running on your system.

Many applications and services nowadays provide their own metrics, which can be collected and displayed to the end-user.

Metrics are perfectly suited to building dashboards that display key performance indicators (KPI) over time, using numbers as a data type that is optimized for transmission, compression, processing, and storage as well as easier querying. 

Incident management KPIs

Let’s review a series of indicators designed to help tech companies to understand how often incidents occur, to predict how quickly incidents are going to be acknowledged and resolved, and to clarify later on how they are affected by different data collection strategies.

Mean Time Before Failure (MTBF): MTBF is the average time between repairable failures. The higher the time between failure, the more reliable the system

Mean Time To Acknowledge (MTTA): MTTA is the average time it takes from when an alert is triggered to when work begins on the issue. This metric is useful for tracking your team’s responsiveness and your alert system’s effectiveness.

Mean Time To Recovery (MTTR): MTTR is the average time it takes to repair a system and return to a fully functional state. This time includes not only repair time, but testing time and the time spent ensuring that the failure won’t happen again. The lower the time to recovery, the more efficient the troubleshooting process (root cause analysis) as well as the issues resolution process.

Troubleshooting

Intermittent issues within your infrastructure interrupt your flow of work, frustrate users, and can wreak havoc on your business. The higher the MTBF, the longer a system is likely to work before failing.

A working system will not fail breakdown and understanding why your services may be failing is your first line of defense against the serious consequences of unplanned downtime.

The best way is to identify the issue and resolve it ASAP, before users will notice it and make a buzz around it within the user’s company, or worse, in the community or social networks.

Configuring an alert is a very good way to indicate some abnormalities based on metrics going above or below specified thresholds or based on changes in the pattern of the data compared to the previous time periods (machine learning (ML)-powered anomaly detection). This type of work will allow you to reduce MTTA, however, it will require to have as many metrics collected as possible to configure an alert for all of them. 

On top of reduced MTTA, you also want to have as low MTTR as possible, which is why monitoring solutions should not just notify about an issue, but also help to identify the root cause of those issues and highlight the affected part of your infrastructure.

On the individual server, as well as in dynamic environments services are started, stopped, or just moved around between nodes at any given time. Therefore, it is important to have a way to automatically discover changes in running processes on the node to start and stop collecting relevant metrics with the fine granularity (this will also improve your MTTA and MTTR).

Troubleshooting requires not only an organized and logical approach to eliminate variables and identify causes of problems in a systematic order, but also enriched data, helping you verify your assumptions or drive your investigation process.

The troubleshooting process steps are as follows:

  1. Gather available information.
  2. Describe the problem.
  3. Establish a theory of probable cause.
  4. Test the theory to determine the cause.
  5. Create a plan of action to resolve the problem and test a solution.
  6. Implement the solution.
  7. Test the full system functionality and, if applicable, implement preventive measures.
  8. Document findings, actions, and outcomes.
In many cases the first three steps are the most challenging, so let’s go deeper into steps 1-3 only.

Gather available information

This is the beginning of your investigation process, therefore, limited information can lead to the wrong theory of probable cause of the issue.

The solution for this challenge: - have all possible metrics collected automatically for you without manual intervention (only if you want to tune it) - have the highest possible granularity (per second) - have all data automatically visualized (without prior manual configuration)

Describe the problem

The best way to describe the problem is to list side effects identified based on the Alert, and also understand how other parts of your system are affected. For example: an issue with a particular service generated more logs than usual, and as a side effect the free space has been exhausted on the storage attached.

Establish a theory of probable cause

Monitoring solutions should be able to not only expose metrics for investigation purposes, but to suggest correlation in between them. A good theory should take into account all aspects of the problem, including all anomalies that occur during the time of the investigation period or before that. In many cases, alerts are triggered based on symptoms and not on the actual cause of the issue. The extended data retention policy is a good addition for your investigation. 

Granularity and retention

Every monitoring solution should provide the current state of the system, but the real power comes with the historical data.

A rich history of data can help you understand patterns and trends over time. In the ideal world, it would be good to have all raw metrics data stored indefinitely. However, the cost of data storage and the cost of processing data requires applying a data retention policy.

A data retention policy empowers system administrators to maintain compliance and optimize storage; it clarifies what data should be available in the hot storage, archived, or deleted, and what granularity should be used.

An example of a common data retention policy for time series metrics is presented in the following table:

Retention PeriodData Granularity
0 - 1 weeks1 minute
1 week - 1 month5 minutes
1 month - 1 year1 hour
up to 2 years1 day
Alternatively, a data retention policy can work with a tiering mechanism (providing multiple tiers of data with different granularity on metrics), as exemplified in the following table:
Tier       Retention PeriodData Granularity
Tier 00 - 1 month1 sec
Tier 10 - 6 months1 minute
Tier 20 - 3 year1 hour
In this tiered example, every second tier is sampling the data every 60 points of the previous tier.

When calculating the required storage size for metrics, it is important to remember that for aggregated tiers for a single counter, usually more data is going to be stored, such as the following:

  • The sum of the points aggregated.
  • The min of the points aggregated.
  • The max of the points aggregated.
  • The count of the points aggregated.

Data collection strategies

MTTA is highly dependent on the data collection strategy.

Data collection isn’t always as straightforward as it might seem. There are plenty of opportunities to stumble in this stage, some of which could affect the accuracy of your metrics or even prevent a timely analysis.

There are a few different data collection strategies currently available on the market.

Let’s focus the most common, which are as follows: 

  • Transfer all the data to a third party - Cloud Monitoring Service Provider (CMSP) 
  • Keep all the data inside your infrastructure - On-Premises Monitoring Solution (OPMS)
  • Hybrid, distributed solution (OPMS + CMSP)

Data Collection Strategy Option 1: Transfer all the data to the third party

CMSP requires sending all the collected data to the cloud. Users do not need to run the monitoring-specific infrastructure.

In this case, CMSP is following the principle “Fire and Forget.”

Examples: Datadog, Newrelic, Dynatrace

Installation and configuration

  1. Install data collector.
  2. Configure data collector for the following: 
    1. Define what metrics you would like to collect.
    2. Specify granularity for each metric.
  3. All collected data will be transferred to the CMSP.
  4. CMSP will store aggregated data on their side.
  5. Based on the pricing plan, a predefined retention policy will be applied.

Usage Requirements

  1. Data available only via CMSP webapp
  2. You have some predefined dashboards for specific integrations
  3. In order to visualize metrics data, you have to configure the chart by performing the following: 
    1. Select the specific metric. 
    2. Configure the visualization options.

Most common cost structure and limitations

  1. Pricing plan (usually based on number of nodes or number of metrics)
  2. Extra data ingestion (outside of your plan)
  3. Extra data processing (outside of your plan)
  4. Machine learning as an extra or part of the most expensive plan
  5. Limited data retention (restricted by your plan)
  6. Limited number of containers monitoring (restricted by your plan)
  7. Limited number of metrics (restricted by your plan)
  8. Limited number of events (restricted by your plan)
  9. Cost of sending email notifications usually included in your plan
  10. Low maintenance cost
  11. High networking cost (data transfer, usually Cloud Service Providers charge for outgoing traffic)
  12. In the end, the most expensive option

Key Benefits

  1. Well-rounded features set
  2. Ease of use
  3. Extensive number for integrations driven by the CMSP

Data Collection Strategy Option 2: Keep all the data inside your infrastructure

This option is usually available for On-Premises Monitoring Solutions (OPMS), mainly open-sourced based.

OPMS allows you to keep all collected data on premises and have full control of your data. Users have to run and support the monitoring specific infrastructure.

Example: Prometheus, Grafana, Zabbix, Dynatrace Managed, Netdata Agent Only

Installation and configuration

  1. Install data collector.
  2. Configure data collector for the following:
    1. Define what metrics you would like to collect 
    2. Specify granularity for each metric
  3. Install storage
  4. Configure storage for the following: 
    1. You can keep all collected data within your network 
    2. Flexible retention policy, you can use defaults of define your own.
  5. Configure your Email Service Provider (ESP)
  6. Install visualization tool 
    • Usually available as part of the chosen OPMS 
    • Might be used another open-source solution

Usage Requirements

  1. In order to visualize metrics data, you have to configure the chart for the following:
    1. Define data source 
    2. Select specific metric 
    3. Configure the visualization options
  2. Support monitoring infrastructure

Most common cost structure and limitations

  1. Compute cost based on your usage.
  2. Database cost based on your usage.
  3. High installation cost (time spent by SRE/DevOps to have the solution running).
  4. High maintenance cost.
  5. Cost of sending emails via ESP (Note: this is not required).
  6. Machine learning usually is not available.

Benefits

  1. Monitoring-focused features set.
  2. Extensive number of integrations driven by the open-source community.
  3. Full management of monitoring cost structure.

Data Collection Strategy Option 3: Hybrid, distributed solution

The third option is a mixed approach, allowing you to take advantage of the best of both options 1 and 2 by allowing an extensive feature set from CMSP as well as having flexibility of your data retention and low cost from OPMS.

Due to the distributed nature of this solution, users are able to collect and store data on their premises (in other words: have full control of collected data).

In this scenario, CMSP is playing the role of the orchestrator, as a result, only the metadata needs to be shared with CMSP for the request routing purposes.

In this option, the following metadata can be shared:

  • Nodes topology 
  • The list of metrics collected 
  • The retention information for each metric
Example: Netdata

Netdata can be classified as a hybrid solution because it has two components - open-source Agent and the cloud-based Netdata solution.

Primary responsibilities of the Agent

  • Collect metrics data for the Node, where the agent is running on. More than 2k collectors are currently supported.
  • Store collected metrics. Various database modes are supported: dB engine, ram, save, map, alloc, none.
  • Store data for other Nodes (in case agent playing a role of a “Parent” and collects data from other Agents, called “Children”). Streaming and replication.

Primary responsibilities of the Netdata Cloud solution

  • Visualize data collected from multiple Agents. Data requests routed to the very specific Agents. Routing information build based on the metadata received from Agents.
  • Provide Infrastructure level view data representation
  • Keep alerts state changes from all Nodes
  • Dispatching alerts notifications.

Installation and configuration

  1. Log in in to Netdata.
  2. Install the Agent (includes data collectors with auto-discovery and storage; data collectors are already preconfigured with 1 sec granularity.

Usage Requirements

  1. There is no need to install a visualization tool. Netdata's cloud solution is already there for you.
  2. There is no need to configure charts. Every single metric is already associated to the chart.
  3. You just need to log in in to Netdata to see various dashboards (infrastructure Overview, individual Nodes, Alerts, Machine Learning, etc.) as well as individual charts associated with Alerts.

Most common cost structure and limitations

  1. Compute cost based on your usage (inside your infrastructure)
  2. Database cost based on your usage (inside your infrastructure)
  3. Low installation cost (one-line installation command for manual installations or Ansible playbook for automation)
  4. Low maintenance cost (Agent automatically updated)
  5. Netdata will send all emails for free
  6. Machine learning enabled by default on the Agent, visualized for free within the Netdata
  7. Free Nodes reachability alerts from Netdata
  8. Stated plainly, this is the cheapest option.

Benefits

  1. Mainly troubleshooting-focused features set.
  2. Ease of installation and maintenance
  3. Extensive number for integrations driven by the open-source community
  4. Data immediately available for querying

Summary

The following summarizes what is important for troubleshooting purposes: 

  • You should be able to collect as many metrics as you want 
  • Metrics should be collected automatically with high granularity (1 sec) 
  • You need to retain as much data as you want at the minimum cost
  • You need to provide an ability to contribute (i.e. create your own collector) 
  • You should be able to easily visualize all metrics (no need to configure chart for every metric) 
  • You need fast access to data metrics (data should be available ASAP, ideally next second) 
  • You should be able to automatically identify anomalies and suggest correlations across all collected metrics
With these in mind, let’s come back to our data collection strategies

Option 1: Transfer all the data to the third party (CMSP)

This option is good for generic monitoring purposes, with limited troubleshooting capabilities due to designed data flow. It is also the most expensive option, leaving you to deal with the following: 

  • Manual intervention to enable and configure data collectors 
  • High cost for data transfer, processing and storage leads to low granularity of data and limited number of metrics to be collected 
  • Manual chart configuration, requires a prior knowledge of available metrics 
  • Making assumptions based on the experience, rather than on data available (you need to know what metric you would like to check) 
  • Significant lag before data will be available for querying (due to data flow design)

Option 2: Keep all the data inside your infrastructure (OPMS)

This option is cheap, but the least helpful solution for troubleshooting needs. It has the same limitations as Option 1, due to aggregation needs, plus you will be saddled with the following: 

  • Lower number of metrics and low granularity usually are the suggested way 
  • Limited number of features available; for example, an ML-based charts suggestion mechanism will not be available.
  • Burden of complete ownership of the monitoring/troubleshooting infrastructure on the user.

Option 3: Hybrid, distributed solution

This option is the best case for troubleshooting purposes, as this solution allows you to have highest granularity with a significant number of metrics automatically collected for you. 

  • Full control of cost 
  • No need to pay for outgoing traffic. Similar to Option 2, data is stored inside your own infrastructure that does not need transferred outside of your network (no need to pay for the outgoing traffic/ 
  • Data immediately available for querying, no need to wait for data transfer and processing.
It is worth paying attention to the free infrastructure monitoring solution focusing on troubleshooting in the first place:  Netdata.

The Netdata Agent is free by its open-source definition (license GNU GPL v3).

The Netdata cloud solution is the close-source software; however, it is able to provide a free orchestration service for everyone (only metadata is going to be transferred to the Netdata Hub and not actual data, this is why the cost of the service is negligible and can be provided free of charge by Netdata).

In the future, you will be able to get a paid support plan in case you would like to get extra help on top of the free community support. Netdata also plans to offer Managed Data Centralization Points (Netdata parents, to keep not only the metadata, but the actual data as well) at additional cost. More details are available here.

On top of the already-described benefits of the Hybrid solution, Netdata is able to automatically show charts relevant to the highlighted area across all collected metrics (every single metric automatically has a chart representation). Netdata is also able to show metric anomalies based on Machine Learning (running on the Netdata Agent - client side and not on the CMSP).

Questions? Ideas? Comments? Learn more or contact us!

Feel free to dive deeper into the Netdata knowledge and community using any of the following resources:

  • Netdata Learn: Find documentation, guides, and reference material for monitoring and troubleshooting your systems with Netdata.
  • Github Issues: Make use of the Netdata repository to report bugs or open a new feature request.
  • Github Discussions: Join the conversation around the Netdata development process and be a part of it.
  • Community Forums: Visit the Community Forums and contribute to the collaborative knowledge base.
  • Discord: Jump into the Netdata Discord and hangout with like-minded sysadmins, DevOps, SREs and other troubleshooters. More than 1100 engineers are already using it!