Consul monitoring with Netdata

What is Consul?

Consul is a distributed service discovery and configuration system. It provides a way to reliably store and manage configuration information, and allows services to quickly discover and connect to one another. With Consul, organizations can quickly build and maintain distributed systems.

Monitoring Consul with Netdata

The prerequisites for monitoring Consul with Netdata are to have Consul and Netdata installed on your system.

Netdata auto discovers hundreds of services, and for those it doesn’t turning on manual discovery is a one line configuration. For more information on configuring Netdata for Consul monitoring please read the collector documentation.

You should now see the Consul section on the Overview tab in Netdata Cloud already populated with charts about all the metrics you care about.

Netdata has a public demo space (no login required) where you can explore different monitoring use-cases and get a feel for Netdata.

What Consul metrics are important to monitor - and why?

Leadership Changes

Server Leadership Status

Server Leadership Status is an important metric to monitor because it provides an indication of the current leader node in the cluster. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is running optimally.

Raft Leader Last Contact Time

Raft Leader Last Contact Time is an important metric to monitor because it provides an indication of how long it has been since the leader node contacted the followers in the cluster. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is running optimally.

Raft Leader Elections Rate

Raft Leader Elections Rate is an important metric to monitor because it provides an indication of the rate at which leader elections are being performed. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is performing elections too often.

Raft Follower Last Contact Leader Time

Raft Follower Last Contact Leader Time is an important metric to monitor because it provides an indication of how long it has been since the follower contacted the leader node. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is running optimally.

Raft Leadership Transitions Rate

Raft Leadership Transitions Rate is an important metric to monitor because it provides an indication of the rate at which leadership transitions are being performed. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is performing transitions too often.

Transaction Timing

KVS Apply Time

KVS Apply Time is an important metric to monitor because it provides an indication of the amount of time it takes for key-value store changes to be applied. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is taking too long to apply changes.

KVS Apply Operations Rate

KVS Apply Operations Rate is an important metric to monitor because it provides an indication of the rate at which key-value store changes are being applied. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is applying changes too slowly.

Txn Apply Time

Txn Apply Time is an important metric to monitor because it provides an indication of the amount of time it takes for transaction changes to be applied. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is taking too long to apply changes.

Txn Apply Operations Rate

Txn Apply Operations Rate is an important metric to monitor because it provides an indication of the rate at which transaction changes are being applied. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is applying changes too slowly.

Raft Commit Time

Raft Commit Time is an important metric to monitor because it provides an indication of the amount of time it takes for changes to be committed to the Raft log. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is taking too long to commit changes.

Raft Commits Rate

Raft Commits Rate is an important metric to monitor because it provides an indication of the rate at which changes are being committed to the Raft log. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is committing changes too slowly.

Autopilot

Autopilot Health Status

Autopilot Health Status is an important metric to monitor because it provides an indication of the health of the Autopilot service on the Consul cluster. Monitoring this metric can help identify potential performance issues, as well as determine if the Autopilot service is running optimally.

Autopilot Failure Tolerance

Autopilot Failure Tolerance is an important metric to monitor because it provides an indication of the number of servers that can fail before the Autopilot service will become unhealthy. Monitoring this metric can help identify potential performance issues, as well as determine if the Autopilot service is running optimally.

Autopilot Server Health Status

Autopilot Server Health Status is an important metric to monitor because it provides an indication of the health of the individual servers in the Autopilot service on the Consul cluster. Monitoring this metric can help identify potential performance issues, as well as determine if the individual servers in the Autopilot service are running optimally.

Autopilot Server Stable Time

Autopilot Server Stable Time is an important metric to monitor because it provides an indication of the amount of time for which the servers in the Autopilot service have been stable. Monitoring this metric can help identify potential performance issues, as well as determine if the Autopilot service is running optimally.

Autopilot Server Serf Status

Autopilot Server Serf Status is an important metric to monitor because it provides an indication of the status of the Serf service on the individual servers in the Autopilot service. Monitoring this metric can help identify potential performance issues, as well as determine if the Autopilot service is running optimally.

Autopilot Server Voter Status

Autopilot Server Voter Status is an important metric to monitor because it provides an indication of the status of the individual servers in the Autopilot service in terms of whether they are voters or not. Monitoring this metric can help identify potential performance issues, as well as determine if the Autopilot service is running optimally.

Memory

Memory Allocated

Memory Allocated is an important metric to monitor because it provides an indication of the amount of memory being used by the Consul cluster. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is using more memory than necessary.

Memory Sys

Memory Sys is an important metric to monitor because it provides an indication of the total amount of system memory being used by the Consul cluster. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is using more memory than necessary.

Garbage Collection

GC Pause Time

GC Pause Time is an important metric to monitor because it provides an indication of the amount of time that garbage collection is taking. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is spending too much time garbage collecting.

RPC Network Activity

Client RPC Requests Rate

Client RPC Requests Rate is an important metric to monitor because it provides an indication of the number of RPC requests being made by clients to the Consul cluster. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is being overloaded with requests.

Client RPC Requests Exceeded Rate

Client RPC Requests Exceeded Rate is an important metric to monitor because it provides an indication of the number of requests that are exceeding the configured rate limits on the Consul cluster. Monitoring this metric can help identify if there is a bottleneck in the cluster, or if requests are being made too frequently.

Client RPC Requests Failed Rate

Client RPC Requests Failed Rate is an important metric to monitor because it provides an indication of the number of requests that are failing on the Consul cluster. Monitoring this metric can help identify potential performance issues, as well as identify potential bugs that may be causing requests to fail.

Raft RPC Install Snapshot Time

Raft RPC Install Snapshot Time is an important metric to monitor because it provides an indication of the amount of time it takes for the Raft log snapshot to be installed. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is taking too long to install the snapshot.

Healthchecks

Node Health Check Status

Node Health Check Status is an important metric to monitor because it provides an indication of the health of the nodes in the Consul cluster. If a node is not passing its health checks, it can prevent services from running on that node, or from being able to communicate with other nodes in the cluster. Monitoring this metric is important for ensuring that the cluster is running optimally and that services are able to communicate properly.

Service Health Check Status

Service Health Check Status is an important metric to monitor because it provides an indication of the health of the services running in the Consul cluster. If a service is not passing its health checks, it can prevent services from running on that node, or from being able to communicate with other services in the cluster. Monitoring this metric is important for ensuring that the services are running optimally and that they are able to communicate properly.

Network RTT

Network LAN RTT

Network LAN RTT is an important metric to monitor because it provides an indication of the round-trip time (RTT) between nodes in the Consul cluster. Monitoring this metric can help identify potential performance issues, as well as determine if the nodes in the cluster are communicating properly.

Raft Saturation

Raft Thread Main Saturation Percent

Raft Thread Main Saturation Percent is an important metric to monitor because it provides an indication of how much of the main thread pool is being used. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is running optimally.

Raft Thread FSM Saturation Percent

Raft Thread FSM Saturation Percent is an important metric to monitor because it provides an indication of how much of the FSM thread pool is being used.

Raft Replication Capacity

Raft FSM Last Restore Duration

This metric captures the duration of the last restore operation done by the Raft FSM (Finite State Machine). It measures how long it took to restore all the data from the snapshot or the log entries. This metric is useful for understanding the performance of the system, as it can help identify potential bottlenecks or long running operations.

Raft Leader Oldest Log Age

Raft Leader Oldest Log Age is an important metric to monitor because it provides an indication of how long it has been since the oldest log entry was written. Monitoring this metric can help identify potential performance issues, as well as determine if the cluster is running optimally.

BoltDB Performance

Raft BoltDB Freelist Bytes

This metric captures the total number of freelist bytes in the BoltDB database, which is the underlying storage service that the Consul Agent uses to store its data. This metric is useful for understanding the size and usage of the BoltDB database, and can help identify potential issues related to large databases or growing data sizes.

Raft BoltDB Logs Per Batch Rate

This metric captures the number of log entries written to the BoltDB database per batch. This metric is useful for understanding the write performance of the BoltDB database, and can help identify potential performance issues related to slow write operations.

Raft BoltDB Store Logs Time

This metric captures the duration of the store logs operation in the BoltDB database. It measures how long it took to store the log entries in the BoltDB database. This metric is useful for understanding the performance of the system, as it can help identify potential bottlenecks or long running operations.

License

License Expiration Time

This metric captures the time remaining until the Consul license expires. This metric is useful for understanding the current state of the license and can help identify potential issues related to an expired license. This can be monitored to ensure that the license is not expired and that the system is compliant with the license agreement.

Get Netdata

Sign up for free

Want to see a demonstration of Netdata for multiple use cases?

Go to Live Demo