CockroachDB monitoring with Netdata

What is CockroachDB?

CockroachDB is an open source distributed SQL database designed to make it easy to build global, scalable cloud applications that survive disasters. It offers strong consistency, high availability, and horizontal scalability. It also provides advanced features like distributed transactions, low latency reads and writes, and distributed ACID transactions.

Monitoring CockroachDB with Netdata

The prerequisites for monitoring CockroachDB with Netdata are to have CockroachDB and Netdata installed on your system.

Netdata auto discovers hundreds of services, and for those it doesn’t turning on manual discovery is a one line configuration. For more information on configuring Netdata for CockroachDB monitoring please read the collector documentation.

You should now see the CockroachDB section on the Overview tab in Netdata Cloud already populated with charts about all the metrics you care about.

Netdata has a public demo space (no login required) where you can explore different monitoring use-cases and get a feel for Netdata.

What CockroachDB metrics are important to monitor - and why?

Process

Process CPU Time Combined Percentage

Process CPU Time Combined Percentage is the sum of the user and system CPU time as a percentage of the total CPU time available. This metric is important to monitor as it can help you understand how much of the total CPU time is being used, allowing you to identify potential performance bottlenecks. High CPU utilization can indicate that the application is not running as efficiently as it could be, or that it is being bottlenecked by other parts of the system.

Process CPU Time Percentage

Process CPU Time Percentage is the percentage of CPU time used by a process in both user and system mode. Monitoring this metric can help you identify processes that are using more CPU time than expected and can help you troubleshoot performance issues. High CPU utilization can indicate that the application is not running as efficiently as it could be, or that it is being bottlenecked by other parts of the system.

Process CPU Time

Process CPU Time is the amount of CPU time used by a process in both user and system mode. It is important to monitor this metric to ensure that the applications are not using too much CPU time, as this can indicate that the application is not running efficiently. High CPU utilization can indicate that the application is not running as efficiently as it could be, or that it is being bottlenecked by other parts of the system.

Process Memory

Process Memory is the amount of RAM used by a process. This metric is important to monitor as it can help you understand how much memory a process is consuming, allowing you to identify potential memory leaks or identify processes that are using more memory than expected.

Process File Descriptors

Process File Descriptors is the number of open file descriptors for a process. It is important to monitor this metric as it can help you identify processes that are not closing their file descriptors properly, which can lead to performance issues.

Process Uptime

Process Uptime is the total amount of time a process has been running. It is important to monitor this metric as it can help you identify processes that are not running efficiently or that have crashed. High uptime can also indicate that the process is not restarting properly, which can lead to performance issues.

Hosts

Host Disk Bandwidth

Host disk bandwidth is a metric which measures the rate of data read and written to/from the disk, typically measured in KiB per second. This metric is important to monitor as it is indicative of the overall performance of the underlying storage infrastructure. A decrease in disk bandwidth can indicate a bottleneck in the system, and can lead to decreased performance of the application.

Host Disk Operations

Host disk operations is a metric which measures the rate of read and write operations performed on the disk, typically measured in operations per second. This metric is important to monitor as it is indicative of the utilization of the underlying storage infrastructure. An increase in disk operations can indicate an increase in utilization, and can lead to decreased performance of the application if not monitored.

Host Disk IOPS in Progress

Host disk IOPS in progress is a metric which measures the number of disk IO operations in progress at any given time, typically measured in IOPS. This metric is important to monitor as it is indicative of the amount of disk IO operations in progress. An increase in disk IOPS in progress can indicate an increase in activity, and can lead to increased latency if not monitored.

Host Network Bandwidth

Host network bandwidth is a metric which measures the rate of data sent and received over the network, typically measured in kilobits per second. This metric is important to monitor as it is indicative of the overall performance of the underlying network infrastructure. A decrease in network bandwidth can indicate a bottleneck in the system, and can lead to decreased performance of the application.

Host Network Packets

Host network packets is a metric which measures the rate of packets sent and received over the network, typically measured in packets per second. This metric is important to monitor as it is indicative of the utilization of the underlying network infrastructure. An increase in packets can indicate an increase in utilization, and can lead to decreased performance of the application if not monitored.

Nodes

Live Nodes

Live nodes is a metric that measures the number of healthy nodes in a CockroachDB cluster. This metric is important because it gives an indication of the overall health of the cluster, as well as the availability of the cluster. A healthy cluster should have a high number of live nodes, and if the number of live nodes decreases, it can indicate a problem with the cluster.

Node Liveness Heartbeats

Node liveness heartbeats measure the number of successful and failed heartbeats sent between nodes in a CockroachDB cluster. This metric is important because it can be used to detect whether a node is healthy or not. If a node fails to send a successful heartbeat, it can indicate that the node is unhealthy or has failed. Monitoring this metric can help to identify and prevent potential issues with the cluster.

Storage

Total Storage Capacity

Total storage capacity is a metric that measures the total amount of storage space available in a CockroachDB cluster. This metric is important because it can help to identify potential issues with the size of the cluster, as well as to ensure that the cluster has enough storage space to support the application.

Storage Capacity Usability

Storage capacity usability is a metric that measures the amount of usable and unusable storage capacity in a CockroachDB cluster. This metric is important because it can help to identify and prevent potential issues with the storage capacity of the cluster. Monitoring this metric can help to ensure that the cluster has enough storage space to support the application.

Storage Usable Capacity

Storage usable capacity is a metric that measures the amount of available and used storage capacity in a CockroachDB cluster. This metric is important because it can help to identify and prevent potential issues with the storage capacity of the cluster. Monitoring this metric can help to ensure that the cluster has enough storage space to support the application.

Storage Used Capacity Percentage

Storage used capacity percentage is a metric that measures the percentage of total and usable storage capacity used in a CockroachDB cluster. This metric is important because it can help to identify potential issues with the storage capacity of the cluster. Monitoring this metric can help to ensure that the cluster has enough storage space to support the application and to prevent any potential performance issues caused by over-utilization of storage capacity.

SQL

SQL Connections

SQL connections are an important metric to monitor in CockroachDB because they are a key indicator of how much load the system is handling. By monitoring the active connections, you can identify any potential issues with the system such as overloaded resources or unexpected spikes in usage. Additionally, by tracking the number of connections over time, you can establish a baseline for normal usage and identify any potential issues before they become a problem.

SQL Bandwidth

SQL Bandwidth is a metric that can help you understand how much traffic is being sent and received by your CockroachDB cluster. By tracking the received and sent bandwidth, you can detect any unexpected spikes in traffic and identify any potential issues with the system’s performance. Additionally, by monitoring the bandwidth metrics over time, you can establish a baseline for normal usage and identify any potential performance issues before they become a problem.

SQL Statements Total

SQL Statements Total is a metric that can help you understand the number of SQL statements that have been started and executed within your CockroachDB cluster. By monitoring the number of started and executed statements, you can detect any unexpected spikes in activity and identify any potential issues with the system’s performance. Additionally, by tracking the number of statements over time, you can establish a baseline for normal usage and identify any potential performance issues before they become a problem.

SQL Errors

SQL Errors is a metric that can help you understand any errors that might be occurring within your CockroachDB cluster. By monitoring the number of statement and transaction errors, you can identify any potential issues with the system’s performance. Additionally, by tracking the number of errors over time, you can establish a baseline for normal usage and identify any potential performance issues before they become a problem.

SQL Started DDL Statements

SQL Started DDL Statements is a metric that can help you understand the number of DDL statements that have been started within your CockroachDB cluster. By monitoring the number of started DDL statements, you can detect any unexpected spikes in activity and identify any potential issues with the system’s performance. Additionally, by tracking the number of DDL statements over time, you can establish a baseline for normal usage and identify any potential performance issues before they become a problem.

SQL Executed DDL Statements

SQL Executed DDL Statements is a metric that can help you understand the number of DDL statements that have been executed within your CockroachDB cluster. By monitoring the number of executed DDL statements, you can detect any unexpected spikes in activity and identify any potential issues with the system’s performance. Additionally, by tracking the number of DDL statements over time, you can establish a baseline for normal usage and identify any potential performance issues before they become a problem.

SQL Started DML Statements

SQL Started DML Statements is a metric that can help you understand the number of DML statements that have been started within your CockroachDB cluster. By monitoring the number of started DML statements, you can detect any unexpected spikes in activity and identify any potential issues with the system’s performance. Additionally, by tracking the number of DML statements over time, you can establish a baseline for normal usage and identify any potential performance issues before they become a problem.

SQL Executed DML Statements

SQL Executed DML Statements is a metric that can help you understand the number of DML statements that have been executed within your CockroachDB cluster. By monitoring the number of executed DML statements, you can detect any unexpected spikes in activity and identify any potential issues with the system’s performance. Additionally, by tracking the number of DML statements over time, you can establish a baseline for normal usage and identify any potential performance issues before they become a problem.

SQL Started TCL Statements

SQL Started TCL Statements is a metric that can help you understand the number of TCL statements that have been started within your CockroachDB cluster. By monitoring the number of started TCL statements, you can detect any unexpected spikes in activity and identify any potential issues with the system’s performance. Additionally, by tracking the number of TCL statements over time, you can establish a baseline for normal usage and identify any potential performance issues before they become a problem.

SQL Executed TCL Statements

SQL Executed TCL Statements is a metric that can help you understand the number of TCL statements that have been executed within your CockroachDB cluster. By monitoring the number of executed TCL statements, you can detect any unexpected spikes in activity and identify any potential issues with the system’s performance. Additionally, by tracking the number of TCL statements over time, you can establish a baseline for normal usage and identify any potential performance issues before they become a problem.

SQL Active Distributed Queries

SQL Active Distributed Queries is a metric that can help you understand the number of active distributed queries that are running within your CockroachDB cluster. By monitoring this metric, you can identify any unexpected spikes in activity and identify any potential issues with the system’s performance. Additionally, by tracking the number of active distributed queries over time, you can establish a baseline for normal usage and identify any potential performance issues before they become a problem.

SQL Distributed Flows

SQL Distributed Flows is a metric that can help you understand the number of active and queued distributed flows that are running within your CockroachDB cluster. By monitoring this metric, you can identify any unexpected spikes in activity and identify any potential issues with the system’s performance. Additionally, by tracking the number of active and queued distributed flows over time, you can establish a baseline for normal usage and identify any potential performance issues before they become a problem.

Data and Transactions

Live Bytes

The amount of memory in kilobytes (KiB) that the CockroachDB cluster is using for applications and system related activities. By monitoring this metric, DevOps and SRE engineers can gain insight into the memory usage of the cluster, and can take proactive steps to ensure that the cluster is adequately provisioned with enough memory for optimal performance.

Logical Data

The amount of logical data in kilobytes (KiB) that the CockroachDB cluster is managing, including both keys and values. This metric can be used to identify if the cluster is managing too much data, as this can lead to poor performance. Monitoring this metric can help identify possible performance issues, and help determine whether additional resources such as additional nodes or more memory need to be provisioned to the cluster.

Logical Data Count

The total number of logical data elements (keys and values) that the CockroachDB cluster is managing. This metric can be used to identify if the cluster is managing too much data, as this can lead to poor performance. Monitoring this metric can help identify possible performance issues, and help determine whether additional resources such as additional nodes or more memory need to be provisioned to the cluster.

KV Transactions

The number of CockroachDB transactions, including committed, fast-path committed, and aborted transactions. Monitoring this metric can help identify potential issues with transactions, and can help determine if replication or availability settings need to be adjusted.

KV Transaction Restarts

The number of CockroachDB transaction restarts, including write too old, write too old multiple, forwarded timestamp, possible reply, async consensus failure, read within uncertainty interval, aborted, push failure, and unknown restarts. Monitoring this metric can help identify potential issues with transactions, and can help determine if replication or availability settings need to be adjusted.

Ranges

Ranges

The number of ranges that the CockroachDB cluster is managing. Monitoring this metric can help identify potential issues with ranges and can help determine if replication or availability settings need to be adjusted.

Ranges Replication Problem

The number of ranges that are unavailable, under-replicated, or over-replicated. Monitoring this metric can help identify potential issues with replication, and can help determine if replication or availability settings need to be adjusted.

Range Events

The number of range events, including split, add, remove, and merge events. Monitoring this metric can help identify potential issues with ranges, and can help determine if replication or availability settings need to be adjusted.

Range Snapshot Events

The number of range snapshot events, including generated, applied RAFT initiated, applied learner, and applied preemptive events. Monitoring this metric can help identify potential issues with range snapshots, and can help determine if replication or availability settings need to be adjusted.

RocksDB

RocksDB Read Amplification

The number of RocksDB reads per query. Monitoring this metric can help identify potential issues with reads, and can help determine if replication or availability settings need to be adjusted.

RocksDB Table Operations

The number of CockroachDB table operations, including compactions and flushes. Monitoring this metric can help identify potential issues with table operations, and can help determine if replication or availability settings need to be adjusted.

RocksDB Cache Usage

The amount of RocksDB cache memory in kilobytes (KiB) that is being used. Monitoring this metric can help identify potential issues with cache memory usage, and can help determine if replication or availability settings need to be adjusted.

RocksDB Cache Operations

The number of RocksDB cache operations, including hits and misses. Monitoring this metric can help identify potential issues with cache operations, and can help determine if replication or availability settings need to be adjusted.

RocksDB Cache Hit Rate

The RocksDB cache hit rate, expressed as a percentage. Monitoring this metric can help identify potential issues with cache operations, and can help determine if replication or availability settings need to be adjusted.

Replication

Replicas

Replicas is a metric that measures the number of replicas available in the cluster. It is important to monitor replicas because having a sufficient number of replicas is essential for keeping the cluster available and resilient to failure. With the right number of replicas, the cluster can survive the loss of one or more nodes without any service disruption.

Replicas Quiescence

Replicas Quiescence is a metric that measures the number of replicas that are in a quiescent state. Quiescence is a state where a replica is no longer participating in the cluster operations, such as leader election or consensus. Monitoring this metric helps to ensure that the cluster is running at optimal performance and that no replicas are stuck in a quiescent state.

Replicas Leaders

Replicas Leaders is a metric that measures the number of replicas that are currently acting as leaders. This metric helps to ensure that the cluster is running efficiently and that there is sufficient leadership to perform consensus and ordering operations.

Replicas Leaseholders

Replicas Leaseholders is a metric that measures the number of replicas that are currently acting as leaseholders. This metric helps to ensure that the cluster is running efficiently, and that there is sufficient leadership to perform consensus and ordering operations.

Other metrics

Queue Processing Failures

Queue Processing Failures is a metric that measures the number of failures that occur during queue processing. This metric is important to monitor because it can signal potential issues with the cluster, such as slow performance or data corruption.

Rebalancing Queries

Rebalancing Queries is a metric that measures the average number of queries that are processed per second during rebalancing. This metric is important to monitor because it helps to ensure that the cluster is running at optimal performance, and that the queries are being processed quickly and efficiently.

Rebalancing Writes

Rebalancing Writes is a metric that measures the average number of writes that are processed per second during rebalancing. This metric is important to monitor because it helps to ensure that the cluster is running at optimal performance, and that the writes are being processed quickly and efficiently.

Timeseries Samples

Timeseries Samples is a metric that measures the number of samples that are written to the cluster’s time series database. This metric is important to monitor because it helps to ensure that the cluster is running at optimal performance, and that the time series data is being written quickly and efficiently.

Timeseries Write Errors

Timeseries Write Errors is a metric that measures the number of errors that occur during writing to the cluster’s time series database. This metric is important to monitor because it can signal potential issues with the cluster, such as slow performance or data corruption.

Timeseries Write Bytes

Timeseries Write Bytes is a metric that measures the amount of data (in KiB) that is written to the cluster’s time series database. This metric is important to monitor because it helps to ensure that the cluster is running at optimal performance, and that the time series data is being written quickly and efficiently.

Slow Requests

Slow Requests is a metric that measures the number of requests that take longer than the configured latency threshold to process. This metric is important to monitor because it helps to identify requests that are taking longer than expected to process, which can indicate potential issues with the cluster’s performance.

Code Heap Memory Usage

Code Heap Memory Usage is a metric that measures the amount of memory (in KiB) used by the code heap. This metric is important to monitor because it helps to ensure that the cluster is running at optimal performance, and that the code heap is not over-allocating memory.

Goroutines

Goroutines is a metric that measures the number of goroutines that are currently running. This metric is important to monitor because it helps to ensure that the cluster is running at optimal performance, and that goroutines are not over-utilizing resources.

GC Count

GC Count is a metric that measures the number of garbage collection (GC) invocations. This metric is important to monitor because it helps to ensure that the cluster is running at optimal performance, and that GC is not over-utilizing resources.

GC Pause

GC Pause is a metric that measures the amount of time (in microseconds) that garbage collection (GC) pauses the application. This metric is important to monitor because it helps to ensure that the cluster is running at optimal performance, and that GC pauses are not affecting application performance.

CGO Calls

CGO Calls is a metric that measures the number of calls to the CGO library.

Get Netdata

Sign up for free

Want to see a demonstration of Netdata for multiple use cases?

Go to Live Demo