Apache Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. It stores data reliably even in the case of hardware failure and is designed to run on commodity hardware. HDFS is highly fault-tolerant, providing high throughput access to application data and is suitable for applications with large data sets.
The prerequisites for monitoring HDFS with Netdata are to have HDFS and Netdata installed on your system.
Netdata auto discovers hundreds of services, and for those it doesn’t turning on manual discovery is a one line configuration. For more information on configuring Netdata for HDFS monitoring please read the collector documentation.
You should now see the HDFS section on the Overview tab in Netdata Cloud already populated with charts about all the metrics you care about.
Netdata has a public demo space (no login required) where you can explore different monitoring use-cases and get a feel for Netdata.
Heap Memory is the amount of memory allocated to the Java Virtual Machine (JVM) for running HDFS. Monitoring Heap Memory is important for ensuring that the JVM has enough memory to function properly. If the JVM does not have enough memory, it can cause out of memory errors or poor performance.
GC Count Total, or Garbage Collection Count Total, is a metric that shows the total number of garbage collection events during the current time window. Garbage collection is the process of reclaiming memory that is no longer being used. Monitoring this metric is important for ensuring that the JVM is able to reclaim memory efficiently and is not running out of memory.
GC Time Total, or Garbage Collection Time Total, is a metric that shows the total amount of time spent on garbage collection during the current time window. Monitoring this metric is important for ensuring that the JVM is able to reclaim memory efficiently and is not running out of memory.
GC Threshold is a metric that shows the number of garbage collection events that should trigger an alert. This metric can be used to alert administrators of potential memory issues.
Threads is a metric that shows the number of active threads in the JVM. Monitoring this metric is important for ensuring that the JVM is able to process tasks efficiently and that there are no threading issues.
Logs Total is a metric that shows the total number of logs generated during the current time window. Monitoring this metric is important for ensuring that the system is logging correctly and that there are no issues with the logging system.
RPC Bandwidth is a metric that shows the amount of data transferred during Remote Procedure Calls (RPCs). Monitoring this metric is important for ensuring that the system is able to send and receive data efficiently.
RPC Calls is a metric that shows the number of RPC calls made during the current time window. Monitoring this metric is important for ensuring that the system is able to send and receive data efficiently.
Open Connections is a metric that shows the number of open connections to the HDFS cluster. Monitoring this metric is important for ensuring that the system is able to process requests efficiently and that the system is not overloaded.
Call Queue Length is a metric that shows the length of the call queue. Monitoring this metric is important for ensuring that the system is able to process requests efficiently and that the system is not overloaded.
Avg Queue Time is a metric that shows the average time spent in the call queue. Monitoring this metric is important for ensuring that the system is able to process requests efficiently and that the system is not overloaded.
Avg Processing Time is a metric that shows the average time spent processing requests. Monitoring this metric is important for ensuring that the system is able to process requests efficiently and that the system is not overloaded.
Capacity is a metric that shows the amount of disk space available and used in the HDFS cluster. Monitoring this metric is important for ensuring that the system has enough disk space to perform its tasks.
Used Capacity is a metric that shows the amount of disk space used for HDFS and non-HDFS operations. Monitoring this metric is important for ensuring that the system has enough disk space to perform its tasks.
Load is a metric that shows the load on the HDFS cluster. Monitoring this metric is important for ensuring that the system is able to process requests efficiently and that the system is not overloaded.
Volume Failures Total is a metric that shows the number of storage volume failures during the current time window. Monitoring this metric is important for ensuring that the system is able to access data stored on the storage volumes efficiently.
Files Total is a metric that shows the total number of files stored in the HDFS cluster. Monitoring this metric is important for ensuring that the system is able to access files stored in the system efficiently.
Blocks Total is a metric that shows the total number of data blocks stored in the HDFS cluster. Monitoring this metric is important for ensuring that the system is able to access data stored in the system efficiently.
Blocks is a metric that shows the number of corrupt, missing, and under-replicated data blocks. Monitoring this metric is important for ensuring that the system is able to access data stored in the system efficiently and that data is not lost or corrupted.
Data Nodes is a metric that shows the number of live, dead, and stale DataNodes in the HDFS cluster. Monitoring this metric is important for ensuring that the system is able to access data stored in the system efficiently and that DataNodes are running correctly.
Datanode Capacity is a metric that shows the amount of disk space available and used on DataNodes. Monitoring this metric is important for ensuring that DataNodes have enough disk space to perform their tasks.
Datanode Used Capacity is a metric that shows the amount of disk space used for HDFS and non-HDFS operations on DataNodes. Monitoring this metric is important for ensuring that DataNodes have enough disk space to perform their tasks.
Datanode Failed Volumes is a metric that shows the number of failed storage volumes on DataNodes. Monitoring this metric is important for ensuring that the system is able to access data stored on the storage volumes efficiently.
Datanode Bandwidth is a metric that shows the amount of data read and written by DataNodes. Monitoring this metric is important for ensuring that DataNodes are able to send and receive data efficiently.