The only agent that thinks for itself

Autonomous Monitoring with self-learning AI built-in, operating independently across your entire stack.

Unlimited Metrics & Logs
Machine learning & MCP
5% CPU, 150MB RAM
3GB disk, >1 year retention
800+ integrations, zero config
Dashboards, alerts out of the box
> Discover Netdata Agents
Centralized metrics streaming and storage

Aggregate metrics from multiple agents into centralized Parent nodes for unified monitoring across your infrastructure.

Stream from unlimited agents
Long-term data retention
High availability clustering
Data replication & backup
Scalable architecture
Enterprise-grade security
> Learn about Parents
Fully managed cloud platform

Access your monitoring data from anywhere with our SaaS platform. No infrastructure to manage, automatic updates, and global availability.

Zero infrastructure management
99.9% uptime SLA
Global data centers
Automatic updates & patches
Enterprise SSO & RBAC
SOC2 & ISO certified
> Explore Netdata Cloud
Deploy Netdata Cloud in your infrastructure

Run the full Netdata Cloud platform on-premises for complete data sovereignty and compliance with your security policies.

Complete data sovereignty
Air-gapped deployment
Custom compliance controls
Private network integration
Dedicated support team
Kubernetes & Docker support
> Learn about Cloud On-Premises
Powerful, intuitive monitoring interface

Modern, responsive UI built for real-time troubleshooting with customizable dashboards and advanced visualization capabilities.

Real-time chart updates
Customizable dashboards
Dark & light themes
Advanced filtering & search
Responsive on all devices
Collaboration features
> Explore Netdata UI
Monitor on the go

Native iOS and Android apps bring full monitoring capabilities to your mobile device with real-time alerts and notifications.

iOS & Android apps
Push notifications
Touch-optimized interface
Offline data access
Biometric authentication
Widget support
> Download apps

Best energy efficiency

True real-time per-second

100% automated zero config

Centralized observability

Multi-year retention

High availability built-in

Zero maintenance

Always up-to-date

Enterprise security

Complete data control

Air-gap ready

Compliance certified

Millisecond responsiveness

Infinite zoom & pan

Works on any device

Native performance

Instant alerts

Monitor anywhere

80% Faster Incident Resolution
AI-powered troubleshooting from detection, to root cause and blast radius identification, to reporting.
True Real-Time and Simple, even at Scale
Linearly and infinitely scalable full-stack observability, that can be deployed even mid-crisis.
90% Cost Reduction, Full Fidelity
Instead of centralizing the data, Netdata distributes the code, eliminating pipelines and complexity.
Control Without Surrender
SOC 2 Type 2 certified with every metric kept on your infrastructure.
Integrations

800+ collectors and notification channels, auto-discovered and ready out of the box.

800+ data collectors
Auto-discovery & zero config
Cloud, infra, app protocols
Notifications out of the box
> Explore integrations
Real Results
46% Cost Reduction

Reduced monitoring costs by 46% while cutting staff overhead by 67%.

— Leonardo Antunez, Codyas

Zero Pipeline

No data shipping. No central storage costs. Query at the edge.

From Our Users
"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

No Query Language

Point-and-click troubleshooting. No PromQL, no LogQL, no learning curve.

Enterprise Ready
67% Less Staff, 46% Cost Cut

Enterprise efficiency without enterprise complexity—real ROI from day one.

— Leonardo Antunez, Codyas

SOC 2 Type 2 Certified

Zero data egress. Only metadata reaches the cloud. Your metrics stay on your infrastructure.

Full Coverage
800+ Collectors

Auto-discovered and configured. No manual setup required.

Any Notification Channel

Slack, PagerDuty, Teams, email, webhooks—all built-in.

Built for the People Who Get Paged
Because 3am alerts deserve instant answers, not hour-long hunts.
Every Industry Has Rules. We Master Them.
See how healthcare, finance, and government teams cut monitoring costs 90% while staying audit-ready.
Monitor Any Technology. Configure Nothing.
Install the agent. It already knows your stack.
From Our Users
"A Rare Unicorn"

Netdata gives more than you invest in it. A rare unicorn that obeys the Pareto rule.

— Eduard Porquet Mateu, TMB Barcelona

99% Downtime Reduction

Reduced website downtime by 99% and cloud bill by 30% using Netdata alerts.

— Falkland Islands Government

Real Savings
30% Cloud Cost Reduction

Optimized resource allocation based on Netdata alerts cut cloud spending by 30%.

— Falkland Islands Government

46% Cost Cut

Reduced monitoring staff by 67% while cutting operational costs by 46%.

— Codyas

Real Coverage
"Plugin for Everything"

Netdata has agent capacity or a plugin for everything, including Windows and Kubernetes.

— Eduard Porquet Mateu, TMB Barcelona

"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

Real Speed
Troubleshooting in 30 Seconds

From 2-3 minutes to 30 seconds—instant visibility into any node issue.

— Matthew Artist, Nodecraft

20% Downtime Reduction

20% less downtime and 40% budget optimization from out-of-the-box monitoring.

— Simon Beginn, LANCOM Systems

Pay per Node. Unlimited Everything Else.

One price per node. Unlimited metrics, logs, users, and retention. No per-GB surprises.

Free tier—forever
No metric limits or caps
Retention you control
Cancel anytime
> See pricing plans
What's Your Monitoring Really Costing You?

Most teams overpay by 40-60%. Let's find out why.

Expose hidden metric charges
Calculate tool consolidation
Customers report 30-67% savings
Results in under 60 seconds
> See what you're really paying
Your Infrastructure Is Unique. Let's Talk.

Because monitoring 10 nodes is different from monitoring 10,000.

On-prem & air-gapped deployment
Volume pricing & agreements
Architecture review for your scale
Compliance & security support
> Start a conversation
Monitoring That Sells Itself

Deploy in minutes. Impress clients in hours. Earn recurring revenue for years.

30-second live demos close deals
Zero config = zero support burden
Competitive margins & deal protection
Response in 48 hours
> Apply to partner
Per-Second Metrics at Homelab Prices

Same engine, same dashboards, same ML. Just priced for tinkerers.

Community: Free forever · 5 nodes · non-commercial
Homelab: $90/yr · unlimited nodes · fair usage
> Start monitoring your lab—free
$1,000 Per Referral. Unlimited Referrals.

Your colleagues get 10% off. You get 10% commission. Everyone wins.

10% of subscriptions, up to $1,000 each
Track earnings inside Netdata Cloud
PayPal/Venmo payouts in 3-4 weeks
No caps, no complexity
> Get your referral link
Cost Proof
40% Budget Optimization

"Netdata's significant positive impact" — LANCOM Systems

Calculate Your Savings

Compare vs Datadog, Grafana, Dynatrace

Savings Proof
46% Cost Reduction

"Cut costs by 46%, staff by 67%" — Codyas

30% Cloud Bill Savings

"Reduced cloud bill by 30%" — Falkland Islands Gov

Enterprise Proof
"Better Than Combined Alternatives"

"Better observability with Netdata than combining other tools." — TMB Barcelona

Real Engineers, <24h Response

DPA, SLAs, on-prem, volume pricing

Why Partners Win
Demo Live Infrastructure

One command, 30 seconds, real data—no sandbox needed

Zero Tickets, High Margins

Auto-config + per-node pricing = predictable profit

Homelab Ready
"Absolutely Incredible"

"We tested every monitoring system under the sun." — Benjamin Gabler, CEO Rocket.Net

76k+ GitHub Stars

3rd most starred monitoring project

Worth Recommending
Product That Delivers

Customers report 40-67% cost cuts, 99% downtime reduction

Zero Risk to Your Rep

Free tier lets them try before they buy

Never Fight Fires Alone

Docs, community, and expert help—pick your path to resolution.

Learn.netdata.cloud docs
Discord, Forums, GitHub
Premium support available
> Get answers now
60 Seconds to First Dashboard

One command to install. Zero config. 850+ integrations documented.

Linux, Windows, K8s, Docker
Auto-discovers your stack
> Read our documentation
See Netdata in Action

Watch real-time monitoring in action—demos, tutorials, and engineering deep dives.

Product demos and walkthroughs
Real infrastructure, not staged
> Start with the 3-minute tour
Level Up Your Monitoring
Real problems. Real solutions. 112+ guides from basic monitoring to AI observability.
76,000+ Engineers Strong
615+ contributors. 1.5M daily downloads. One mission: simplify observability.
Per-Second. 90% Cheaper. Data Stays Home.
Side-by-side comparisons: costs, real-time granularity, and data sovereignty for every major tool.

See why teams switch from Datadog, Prometheus, Grafana, and more.

> Browse all comparisons
Edge-Native Observability, Born Open Source
Per-second visibility, ML on every metric, and data that never leaves your infrastructure.
Founded in 2016
615+ contributors worldwide
Remote-first, engineering-driven
Open source first
> Read our story
Promises We Publish—and Prove
12 principles backed by open code, independent validation, and measurable outcomes.
Open source, peer-reviewed
Zero config, instant value
Data sovereignty by design
Aligned pricing, no surprises
> See all 12 principles
Edge-Native, AI-Ready, 100% Open
76k+ stars. Full ML, AI, and automation—GPLv3+, not premium add-ons.
76,000+ GitHub stars
GPLv3+ licensed forever
ML on every metric, included
Zero vendor lock-in
> Explore our open source
Build Real-Time Observability for the World
Remote-first team shipping per-second monitoring with ML on every metric.
Remote-first, fully distributed
Open source (76k+ stars)
Challenging technical problems
Your code on millions of systems
> See open roles
Talk to a Netdata Human in <24 Hours
Sales, partnerships, press, or professional services—real engineers, fast answers.
Discuss your observability needs
Pricing and volume discounts
Partnership opportunities
Media and press inquiries
> Book a conversation
Your Data. Your Rules.
On-prem data, cloud control plane, transparent terms.
Trust & Scale
76,000+ GitHub Stars

One of the most popular open-source monitoring projects

SOC 2 Type 2 Certified

Enterprise-grade security and compliance

Data Sovereignty

Your metrics stay on your infrastructure

Validated
University of Amsterdam

"Most energy-efficient monitoring solution" — ICSOC 2023, peer-reviewed

ADASTEC (Autonomous Driving)

"Doesn't miss alerts—mission-critical trust for safety software"

Community Stats
615+ Contributors

Global community improving monitoring for everyone

1.5M+ Downloads/Day

Trusted by teams worldwide

GPLv3+ Licensed

Free forever, fully open source agent

Why Join?
Remote-First

Work from anywhere, async-friendly culture

Impact at Scale

Your work helps millions of systems

Compliance
SOC 2 Type 2

Audited security controls

GDPR Ready

Data stays on your infrastructure

Blog

Monitoring Disks: Understanding Workload, Performance, Utilization, Saturation, and Latency

Proactive Strategies for Disk Health and Performance
by Satyadeep Ashwathnarayana · May 4, 2023

stacked-netdata

Netdata provides a comprehensive set of charts that can help you understand the workload, performance, utilization, saturation, latency, responsiveness, and maintenance activities of your disks. In this blog we will focus on monitoring disks as block devices, not as filesystems or mount points.

The Disks section in the Overview tab contains all the charts that are mentioned in this blog post. Disks-Overview

Disk Workload and Performance

Netdata charts for monitoring the workload and the throughput of your disks:

  • Disk I/O Bandwidth (disk.io): Displays the amount of data transferred to and from the disk. You can monitor read and write operations individually.

  • Disk Completed I/O Operations (disk.ops): Shows the number of completed disk I/O operations.

  • Disk Merged Operations (disk.mops): Shows the number of merged disk operations.

  • Disk Total I/O Time (disk.iotime): Displays the sum of the duration of all completed I/O operations, useful for understanding the overall workload on your disks.

  • Average Completed I/O Operation Bandwidth (disk.avgsz): Shows the average I/O operation size.

Merging of operations in Linux

The Linux kernel has a mechanism to optimize disk I/O performance by merging adjacent I/O operations before they are issued to the disk. This is particularly beneficial for HDDs, which have longer seek times compared to SSDs.

When the kernel receives multiple I/O operations, it first sorts them by their logical block addresses to minimize seek time. Then, it checks if any of the operations are adjacent or overlapping. If they are, the kernel combines them into a single, larger operation. By merging adjacent operations, the kernel can reduce the total number of I/O operations, decreasing the overhead of initiating individual I/O requests and improving overall disk performance.

Even in the case of SSDs, merging operations can help reduce the total number of I/O requests, which in turn can decrease the overhead associated with initiating individual I/O operations. It can also help distribute write operations more evenly across the SSD’s memory cells, which can be beneficial for the SSD’s wear leveling algorithms and overall lifespan.

How does the disk.ops chart relate to the IOPS commitment we get from a cloud provider?

The disk.ops chart in Netdata shows the number of completed disk I/O operations per second for reads and writes. While this chart can give you a general sense of your disk’s I/O activity, it might not provide a direct measure of the IOPS commitment you get from your cloud provider.

Cloud providers often define their IOPS commitment as the maximum number of input/output operations per second that a storage volume can handle. This commitment may be subject to factors such as storage type, size, and configuration, as well as the I/O characteristics of the workloads running on the volume.

To verify if you’re getting the IOPS commitment from your cloud provider, you should consider monitoring the following aspects:

  1. **Peak IOPS **Monitor the peak IOPS your storage volume achieves during periods of high I/O activity. You can compare these peaks to the IOPS commitment from your cloud provider to ensure you’re receiving the performance you paid for. The disk.ops chart can help you identify peak IOPS, but remember that you need to aggregate the read and write operations to get a total IOPS value.

  2. Sustained IOPS Ensure that your storage volume can sustain the committed IOPS level over an extended period during high I/O activity. This may require observing the disk.ops chart over a longer time frame to find the average IOPS during high activity periods.

  3. Latency High IOPS commitments should be accompanied by low latency, as high IOPS performance with high latency can negatively impact the responsiveness of your applications. You can use Netdata’s disk.await chart to monitor the average time for I/O requests issued to the device to be served (beware, it includes both the time spent in the queue and the time spent servicing the requests).

While the disk.ops chart can provide useful insights into your disk’s I/O activity, you may need to supplement it with additional monitoring and analysis to verify that you’re getting the IOPS commitment from your cloud provider. Comparing peak and sustained IOPS values and monitoring latency can help you ensure that you’re receiving the performance you paid for.

Disk Utilization and Saturation

Netdata provides the following charts for disk utilization and saturation:

  • Disk Utilization Time (disk.util): Measures the amount of time the disk was busy with something, expressed as a percentage of the total working time. High utilization (near 100%) can be an indication of congestion, but not necessarily. This metric only indicates the percentage of time the disk was busy. Many disks, especially SSD and NVMe, may still be able to process additional requests in parallel. See below for a detailed discussion on parallelism.

  • Disk Busy Time (disk.busy): measures the amount of time that a disk was busy with I/O operations.

  • **Disk Current I/O Operations **(disk.qops): Shows the number of I/O operations currently in progress, giving you a snapshot of the current workload on your disks. A high number of concurrent I/O operations could indicate that your storage system is struggling to keep up with demand.

  • Disk Backlog (disk.backlog): Provides an indication of the duration of pending disk operations. By monitoring this metric, you can estimate the expected completion time for operations in progress and identify potential bottlenecks in your storage system. High backlog values may signal that your disks are saturated and unable to process I/O operations quickly enough.

Parallelism, Utilization and Saturation

High disk utilization (near 100%) can be an indication of congestion, especially for HDDs, which have mechanical limitations and longer seek times.

For SSDs and NVMe drives, high utilization (in disk.util) may not necessarily indicate congestion due to their parallelism capabilities, which allow them to handle multiple I/O requests simultaneously.

Parallelism is a feature of modern storage devices that enables them to process multiple I/O operations concurrently. This is achieved through multiple memory chips and I/O queues, allowing the storage device to manage a higher workload without impacting performance significantly.

When interpreting the disk.util chart, consider the following:

  • For HDDs, high utilization could be a sign of congestion and an indication that the disk is struggling to handle the workload. Monitoring other metrics like latency (disk.await) and completed I/O operations (disk.ops) can provide additional insights into the HDD’s performance under various workloads.
  • For SSDs and NVMe drives, high utilization may not immediately signal congestion due to their parallelism capabilities. However, consistently high percentages should be monitored closely, as they could indicate that the storage device is nearing its maximum capacity to service I/O requests in parallel. In these cases, also consider monitoring latency (disk.await), completed I/O operations (disk.ops), and current I/O operations (disk.qops) to gain a comprehensive understanding of the storage device’s performance.
  • For NAS or cloud-provided network storage, high utilization can be influenced by both the underlying storage technology and the network infrastructure. In these scenarios, it is crucial to monitor additional network-related metrics like latency, bandwidth usage, and congestion, along with storage-specific metrics, to gain a complete view of the storage system’s performance and potential bottlenecks.

Disk Latency and Responsiveness

Netdata provides the following charts for disk latency and responsiveness:

  • Average Completed I/O Operation Time (disk.await): This chart measures the average time it takes for I/O requests issued to the device to be served, including the time spent in the queue and the time spent servicing the requests.

  • Average Service Time (disk.svctm): This metric represents the average service time for completed I/O operations. It is calculated using the total busy time of the disk and the number of completed operations. Note that if the disk is capable of executing multiple parallel operations, the reported average service time might be misleading (lower than the actual), as it does not account for parallelism.

Generally SSD and NVMe disks have lower latency. For NAS or cloud-provided network storage, the latency and responsiveness can be influenced by both the underlying storage technology and the network infrastructure.

If you notice consistently high latency values, consider investigating other performance metrics, such as disk utilization (disk.util), completed I/O operations (disk.ops), and current I/O operations (disk.qops), to determine if the disk is experiencing performance issues or approaching its maximum capacity to service I/O requests in parallel.

Disk Maintenance and Housekeeping Operations

The Linux disk subsystem performs two main maintenance and housekeeping operations on disks: discards and flushes.

Discard Operations

Discard operations, also known as TRIM, are an important maintenance activity for storage devices that use flash memory technology, such as SSDs, NVMe devices, USB sticks, and SD cards. Discard operations ensure that the storage device always has a pool of pre-erased blocks ready to use, which can significantly improve write performance and reduce unnecessary wear and tear on the drive.

In Linux, discard operations are issued by the filesystem when a file is deleted or truncated, or when a block of data is moved from one location to another. When a discard operation is issued by the filesystem, the device driver forwards the command to the disk, which then marks the blocks that were freed as “available”, and puts them into a pool of blocks that can be immediately written to in the future.

By using discard operations, disks can avoid the time-consuming process of erasing blocks when new data needs to be written. This can significantly improve the write performance of the device, especially in cases where small amounts of data need to be written to the device at a time. It also reduces the overall wear and tear on the drive, because it reduces the amount of data that needs to be written to the same blocks repeatedly.

Flush Operations

A flush operation is another type of maintenance activity that is used by the Linux disk subsystem to ensure that data is written to a storage device in a timely and efficient manner.

When data is written to a storage device in Linux, it is first written to a cache in memory called the buffer cache. Once data is in the buffer cache, it is considered to be in a “dirty” state, meaning that it has been modified and needs to be written to the storage device. However, the kernel does not immediately write the data to the storage device. Instead, it waits until one of several conditions is met:

  • The dirty writeback timeout expires

  • The amount of dirty data in the buffer cache exceeds a certain threshold

  • The amount of free memory in the system falls below a certain threshold

When one of these conditions is met, the kernel begins a process called “writeback”, which involves writing all dirty data in the buffer cache to the storage device. Writeback is triggered automatically by the disk subsystem, but it can also be triggered manually by the user or an application.

During writeback, the kernel walks through the list of dirty buffers in the buffer cache and writes each buffer to the appropriate location on the storage device. Once all dirty data has been written, the kernel updates the appropriate metadata on the storage device to reflect the changes.

Flush operations are a specific type of writeback operation that is triggered explicitly by the user or an application. A flush operation forces all dirty data in the buffer cache to be written to the storage device immediately, without waiting for the dirty writeback timeout to expire or for other conditions to be met.

Because flush operations involve writing data to the storage device, they are counted as write operations in the disk statistics, along with other types of write operations, such as when an application writes data directly to the storage device.

With a better understanding of discards and flushes, let’s now explore the relevant metrics for monitoring these maintenance activities using Netdata:

  • Disk Completed Extended I/O Operations (disk.ext_ops): This metric measures the number (after merges) of completed discard and flush requests. The disk.ext_ops chart has two dimensions: discards and flushes. Monitoring this metric can help you understand how frequently your storage device performs these maintenance activities.

  • Amount of Discarded Data (disk.ext_io): This metric measures the amount of discarded data that is no longer in use by a mounted file system.

  • Disk Merged Discard Operations (disk.ext_mops): This metric measures the number of merged discard operations.

  • Disk Total I/O Time for Extended Operations (disk.ext_iotime): This metric measures the sum of the duration of all completed discard and flush operations. Monitoring this metric can help you understand the overall time spent on these maintenance activities.

  • Average Completed Extended I/O Operation Time (disk.ext_await): This chart measures the average time for discard/flush requests issued to the device needed to be served. This includes the time spent by the requests in queue and the time spent servicing them.

  • Average Amount of Discarded Data (disk.ext_avgsz): Shows the average discard operation size.

Netdata comes pre-configured with 2 alerts that are automatically attached to all disk block devices. Both of them are silent alerts, meaning that they don’t trigger alert notifications, but they pop-up notifications while viewing the dashboard.

10min_disk_utilization

This alert triggers a warning when the average disk utilization during the last 10 minutes is above or equal to 98%. Once it triggers it will automatically be cleared if the disk utilization over the last 10 minutes falls below 70%.

10min_disk_backlog

This alert triggers a warning when the average disk backlog during the last 10 minutes is above or equal to 5000ms. Once it triggers it will automatically be cleared if the disk backlog over the last 10 minutes falls below 3500 ms.

Related to disk block devices, Netdata can also monitor:

  • Disk S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) hardware attributes. Check this for additional information.

  • Disk temperatures. Check this for additional information.

The Linux disk subsystem is packed with a wealth of additional advanced features, including software RAID arrays, compression, encryption, caching, and an array of filesystems. Netdata is already designed to monitor most of these features, providing real-time, low-latency insights into their activity.

With Netdata’s automatic discovery process, all your disks are automatically detected, and you can easily access a fully automated visualization that provides a rapid and comprehensive overview of your disk and system performance. Whether you’re looking to optimize your disk utilization, prevent congestion, or ensure that you’re receiving the performance you paid for from your cloud provider, Netdata can help monitor, analyze, and optimize your disks with ease.