VerneMQ monitoring with Netdata

What is VerneMQ?

VerneMQ is an open source, distributed, high-performance MQTT message broker. It is designed to enable massive scalability, with fast message delivery and low overhead. It supports all MQTT features and is designed to be clustered for high availability and scalability.

Monitoring VerneMQ with Netdata

The prerequisites for monitoring VerneMQ with Netdata are to have VerneMQ and Netdata installed on your system.

Netdata auto discovers hundreds of services, and for those it doesn’t turning on manual discovery is a one line configuration. For more information on configuring Netdata for VerneMQ monitoring please read the collector documentation.

You should now see the VerneMQ section on the Overview tab in Netdata Cloud already populated with charts about all the metrics you care about.

Netdata has a public demo space (no login required) where you can explore different monitoring use-cases and get a feel for Netdata.

What VerneMQ metrics are important to monitor - and why?

Sockets

The number of events of socket openings and closings per second. It is important to monitor the number of sockets to ensure that the server is not overwhelmed by too many connections. If the number of sockets is too high, it could lead to system instability and delays in response times.

Client keepalive expired

This metric indicates the number of client connections that were closed due to the keepalive timer expiring. If the number of expired client connections is too high, it could be an indication of poor network performance or excessive load on the server.

Socket close timeout

This metric indicates the number of client connections that were closed due to a timeout. This could be caused by a high load on the server, or by a slow or unreliable network.

Socket errors

This metric indicates the number of errors encountered while handling socket operations. It is important to monitor this metric as it can provide insight into potential problems in the system.

Queue processes

This metric indicates the number of queue processes that are currently running. Monitoring this metric can help identify any potential issues related to queue processes, such as too many processes running in parallel or not enough processes running to handle the workload.

Queue processes operations

This metric indicates the number of operations related to queue processes, such as setting up and tearing down processes. Monitoring these operations can help identify any potential issues related to queue processes, such as too many processes being created or not enough processes being destroyed.

Queue process init from storage

This metric indicates the number of queue processes that are initialized from storage. Monitoring this metric can help to identify any potential issues related to the initialization of queue processes, such as slow initialization or too many processes being initialized at once.

Queue messages

This metric indicates the number of messages that are received and sent within the system. It is important to monitor this metric to ensure that the system is able to keep up with the load of messages being sent and received.

Queue undelivered messages

This metric indicates the number of messages that were dropped, expired, or unhandled. It is important to monitor this metric to ensure that messages are being delivered in a timely manner and that there are no potential issues such as messages being dropped or expired.

Router subscriptions

This metric indicates the number of subscriptions that are currently active. Monitoring this metric can help identify any potential issues related to the number of subscriptions, such as too many active subscriptions or too few active subscriptions.

Router matched subscriptions

This metric indicates the number of local and remote subscriptions that were matched. Monitoring this metric can help identify any potential issues related to the number of matched subscriptions, such as too many local or remote subscriptions being matched.

Router memory

This metric indicates the amount of memory being used by the router. It is important to monitor this metric to ensure that the router has enough memory to handle the load of messages being sent and received.

Average scheduler utilization

This metric indicates the average utilization of the scheduler over a given period of time. It is important to monitor this metric to ensure that the scheduler is not overworked and that it is able to keep up with the workload. Monitoring this metric can help to identify any potential issues related to the scheduler, such as too much load or not enough resources being allocated.

System Utilization Scheduler

System utilization scheduler is a metric that measures the amount of available CPU cycles that the VerneMQ is using to process requests. It is important to monitor this metric as it can indicate if VerneMQ is overloaded and unable to process requests efficiently. High system utilization can lead to performance issues, such as slow response times or even system crashes. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

System Processes

System processes is a metric that measures the number of active processes running within the VerneMQ. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the number of active processes is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

System Reductions

System reductions is a metric that measures the average number of reductions per second. Reductions are the process of applying operations to data structures in order to reduce their size. It is important to monitor this metric as it can indicate that the VerneMQ is overloaded or not. If the number of reductions is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

System Context Switches

System context switches is a metric that measures the number of context switches occurring in the VerneMQ. Context switches occur when a process is moved from one CPU to another in order to optimize the CPU usage. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the number of context switches is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

System IO

System IO is a metric that measures the amount of data being sent and received by the VerneMQ. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the amount of data being sent and received is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

System Run Queue

System run queue is a metric that measures the amount of process that are ready to be processed in the VerneMQ. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the amount of process ready to be processed is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

System GC Count

System GC count is a metric that measures the number of Garbage Collection (GC) operations happening in the VerneMQ. Garbage Collection operations are used to clean up memory that is no longer being utilized by the VerneMQ. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the number of GC operations is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

System GC Words Reclaimed

System GC words reclaimed is a metric that measures the amount of memory reclaimed during Garbage Collection (GC) operations. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the amount of memory reclaimed is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

System Allocated Memory

System allocated memory is a metric that measures the amount of memory allocated to both VerneMQ processes and the system as a whole. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the amount of memory allocated is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

Bandwidth

Bandwidth is a metric that measures the amount of data being sent and received by the VerneMQ. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the amount of data being sent and received is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

Retain Messages

Retain messages is a metric that measures the number of messages retained by the VerneMQ. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the number of retained messages is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

Retain Memory

Retain memory is a metric that measures the amount of memory being used to store retained messages. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the amount of memory used is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

Cluster Bandwidth

Cluster bandwidth is a metric that measures the amount of data being sent and received by all nodes in the VerneMQ cluster. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the amount of data being sent and received across the cluster is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

Cluster Dropped

Cluster dropped is a metric that measures the amount of data being dropped by the VerneMQ cluster. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the amount of data being dropped across the cluster is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

Netsplit Unresolved

Netsplit unresolved is a metric that measures the number of unresolved netsplit events in the VerneMQ cluster. Netsplit events occur when communication is disrupted between nodes in the cluster. It is important to monitor this metric as it can indicate whether the VerneMQ is overloaded or not. If the number of unresolved netsplit events is too high, it could indicate that the VerneMQ is not able to handle the load and is struggling to keep up. Monitoring this metric can help identify potential issues early, allowing for quick corrective action to be taken.

Netsplits

Netsplits are the result of a network partition in a distributed system. It is when one or more nodes in the system become disconnected from the others, usually due to a network issue or a failure in hardware or software. Monitoring the number of netsplits is important as it can help to identify and diagnose any underlying network issues that could be causing the disconnections. Additionally, monitoring the rate of netsplits can help to prevent any future disconnections by identifying any concerning trends.

MQTT Auth

The number of MQTT authorization packets sent and received is an important metric to monitor as it can provide insight into the authentication process. Monitoring the number of received and sent authorization packets can help to identify any potential authentication issues, such as a slow authentication process or a high number of failed authentication attempts. Additionally, monitoring the number of authorization packets sent and received for each reason (mqtt_auth_sent_reason and mqtt_auth_received_reason) can help to identify any specific authentication issues that could be occurring.

MQTT Connect

The number of MQTT connect and connack packets sent and received is an important metric to monitor as it can provide insight into the connection process. Monitoring the number of received and sent connect and connack packets can help to identify any potential connection issues, such as a slow connection process or a high number of failed connection attempts. Additionally, monitoring the number of connack packets sent for each reason (mqtt_connack_sent_reason) can help to identify any specific connection issues that could be occurring.

MQTT Disconnect

The number of MQTT disconnect packets sent and received is an important metric to monitor as it can provide insight into the disconnection process. Monitoring the number of received and sent disconnect packets can help to identify any potential disconnection issues, such as a slow disconnection process or a high number of failed disconnection attempts. Additionally, monitoring the number of disconnect packets sent and received for each reason (mqtt_disconnect_sent_reason and mqtt_disconnect_received_reason) can help to identify any specific disconnection issues that could be occurring.

MQTT Subscribe

The number of MQTT subscribe and suback packets sent and received is an important metric to monitor as it can provide insight into the subscription process. Monitoring the number of received and sent subscribe and suback packets can help to identify any potential subscription issues, such as a slow subscription process or a high number of failed subscription attempts. Additionally, monitoring the number of subscribe errors (mqtt_subscribe_error) and authentication errors (mqtt_subscribe_auth_error) can help to identify any specific subscription issues that could be occurring.

MQTT Unsubscribe

The number of MQTT unsubscribe and unsuback packets sent and received is an important metric to monitor as it can provide insight into the unsubscription process. Monitoring the number of received and sent unsubscribe and unsuback packets can help to identify any potential unsubscription issues, such as a slow unsubscription process or a high number of failed unsubscription attempts.

MQTT Publish

The number of MQTT publish packets received and sent by VerneMQ per second. This metric indicates the amount of traffic that is going through the broker, and it can help identify any issues with connectivity or performance. Monitoring this metric can help prevent problems with the broker being overloaded.

MQTT Publish Errors

The number of MQTT publish errors per second. This metric helps identify any errors that are occurring while publishing messages, such as missing or invalid payloads. Monitoring this metric can help prevent issues related to message delivery.

MQTT Publish Auth Errors

The number of MQTT publish authentication errors per second. This metric helps identify any authentication errors that are occurring while publishing messages, such as invalid credentials or unauthorized access. Monitoring this metric can help prevent issues related to unauthorized access.

MQTT Puback

The number of MQTT puback packets received and sent by VerneMQ per second. This metric indicates the amount of traffic that is going through the broker, and it can help identify any issues with connectivity or performance. Monitoring this metric can help prevent problems with the broker being overloaded.

MQTT Puback Received Reason

This metric tracks the number of MQTT puback packets received by VerneMQ per second, broken down by the reason they were received. This metric helps identify any issues that may be related to the reasons why the packets were received. Monitoring this metric can help prevent problems related to packet delivery.

MQTT Puback Sent Reason

This metric tracks the number of MQTT puback packets sent by VerneMQ per second, broken down by the reason they were sent. This metric helps identify any issues that may be related to the reasons why the packets were sent. Monitoring this metric can help prevent problems related to packet delivery.

MQTT Puback Invalid Error

The number of invalid MQTT puback errors per second. This metric helps identify any unexpected errors that are occurring while processing incoming puback messages. Monitoring this metric can help prevent issues related to message delivery.

MQTT Pubrec

The number of MQTT pubrec packets received and sent by VerneMQ per second. This metric indicates the amount of traffic that is going through the broker, and it can help identify any issues with connectivity or performance. Monitoring this metric can help prevent problems with the broker being overloaded.

MQTT Pubrec Received Reason

This metric tracks the number of MQTT pubrec packets received by VerneMQ per second, broken down by the reason they were received. This metric helps identify any issues that may be related to the reasons why the packets were received. Monitoring this metric can help prevent problems related to packet delivery.

MQTT Pubrec Sent Reason

This metric tracks the number of MQTT pubrec packets sent by VerneMQ per second, broken down by the reason they were sent. This metric helps identify any issues that may be related to the reasons why the packets were sent. Monitoring this metric can help prevent problems related to packet delivery.

MQTT Pubrec Invalid Error

The number of invalid MQTT pubrec errors per second. This metric helps identify any unexpected errors that are occurring while processing incoming pubrec messages. Monitoring this metric can help prevent issues related to message delivery.

MQTT Pubrel

The number of MQTT pubrel packets received and sent by VerneMQ per second. This metric indicates the amount of traffic that is going through the broker, and it can help identify any issues with connectivity or performance. Monitoring this metric can help prevent problems with the broker being overloaded.

MQTT Pubrel Received Reason

This metric tracks the number of MQTT pubrel packets received by VerneMQ per second, broken down by the reason they were received. This metric helps identify any issues that may be related to the reasons why the packets were received. Monitoring this metric can help prevent problems related to packet delivery.

MQTT Pubrel Sent Reason

This metric tracks the number of MQTT pubrel packets sent by VerneMQ per second, broken down by the reason they were sent. This metric helps identify any issues that may be related to the reasons why the packets were sent. Monitoring this metric can help prevent problems related to packet delivery.

MQTT Pubcom

The number of MQTT pubcom packets received and sent by VerneMQ per second. This metric indicates the amount of traffic that is going through the broker, and it can help identify any issues with connectivity or performance. Monitoring this metric can help prevent problems with the broker being overloaded.

MQTT Pubcomp Received Reason

This metric tracks the number of MQTT pubcomp packets received by VerneMQ per second, broken down by the reason they were received. This metric helps identify any issues that may be related to the reasons why the packets were received. Monitoring this metric can help prevent problems related to packet delivery.

MQTT Pubcomp Sent Reason

This metric tracks the number of MQTT pubcomp packets sent by VerneMQ per second, broken down by the reason they were sent. This metric helps identify any issues that may be related to the reasons why the packets were sent. Monitoring this metric can help prevent problems related to packet delivery.

MQTT Pubcomp Invalid Error

The number of invalid MQTT pubcomp errors per second. This metric helps identify any unexpected errors that are occurring while processing incoming pubcomp messages. Monitoring this metric can help prevent issues related to message delivery.

MQTT Ping

The number of MQTT ping packets (PINGREQ/PINGRESP) received and sent by VerneMQ per second. This metric indicates the amount of traffic that is going through the broker, and it can help identify any issues with connectivity or performance. Monitoring this metric can help prevent problems with the broker being overloaded.

Node Uptime

The uptime of VerneMQ nodes in seconds. This metric helps identify any unexpected outages or restarts that are occurring. Monitoring this metric can help prevent issues related to service availability.

Get Netdata

Sign up for free

Want to see a demonstration of Netdata for multiple use cases?

Go to Live Demo