Kubernetes Cluster State monitoring with Netdata

What is Kubernetes?

Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. It is extremely important to monitor the state of the Kubernetes Cluster to ensure all the applications are running as expected.

Monitoring Kubernetes with Netdata

The prerequisites for monitoring Kubernetes with Netdata are to have an online Kubernetes Cluster and Netdata installed on your system.

Netdata auto discovers hundreds of services, and for those it doesn’t turning on manual discovery is a one line configuration. For more information on configuring Netdata for Kubernetes Cluster State monitoring please read the collector documentation.

You should now see the Kubernetes State section on the Overview tab in Netdata Cloud already populated with charts about all the metrics you care about.

Netdata has a public demo space (no login required) where you can explore different monitoring use-cases and get a feel for Netdata.

What Kubernetes Cluster State metrics are important to monitor - and why?

Node Allocatable CPU Requests Utilization

This metric indicates the percentage of CPU requests that have been used on a node. It is calculated by dividing the total CPU requests used on the node by the total CPU requests that are allocatable for the node. This metric is important for understanding if the node has enough resources to meet the demands of its workloads. If the utilization is too high, it could lead to performance issues or even failures. CPU Requests Utilization

Node Allocatable CPU Requests Used

This metric indicates the amount of CPU requests that have been used on a node. It is measured in millicpu and can be used to determine the amount of CPU resources that have been allocated to the various workloads running on the node. Monitoring this metric can help identify if there is an imbalance of resources being allocated, which could cause performance issues. CPU Requests Used

Node Allocatable CPU Limits Utilization

This metric indicates the percentage of CPU limits that have been used on a node. It is calculated by dividing the total CPU limits used on the node by the total CPU limits that are allocatable for the node. This metric is important for understanding if the node has enough resources to meet the demands of its workloads. If the utilization is too high, it could lead to performance issues or even failures. CPU Limits Utilization

Node Allocatable CPU Limits Used

This metric indicates the amount of CPU limits that have been used on a node. It is measured in millicpu and can be used to determine the amount of CPU resources that have been allocated to the various workloads running on the node. Monitoring this metric can help identify if there is an imbalance of resources being allocated, which could cause performance issues. CPU Limits Used

Node Allocatable Memory Requests Utilization

This metric indicates the percentage of memory requests that have been used on a node. It is calculated by dividing the total memory requests used on the node by the total memory requests that are allocatable for the node. This metric is important for understanding if the node has enough resources to meet the demands of its workloads. If the utilization is too high, it could lead to performance issues or even failures. Memory Requests Utilization

Node Allocatable Memory Requests Used

This metric indicates the amount of memory requests that have been used on a node. It is measured in bytes and can be used to determine the amount of memory resources that have been allocated to the various workloads running on the node. Monitoring this metric can help identify if there is an imbalance of resources being allocated, which could cause performance issues. Memory Requests Used

Node Allocatable Memory Limits Utilization

This metric indicates the percentage of memory limits that have been used on a node. It is calculated by dividing the total memory limits used on the node by the total memory limits that are allocatable for the node. This metric is important for understanding if the node has enough resources to meet the demands of its workloads. If the utilization is too high, it could lead to performance issues or even failures. Memory Limits Utilization

Node Allocatable Memory Limits Used

This metric indicates the amount of memory limits that have been used on a node. It is measured in bytes and can be used to determine the amount of memory resources that have been allocated to the various workloads running on the node. Monitoring this metric can help identify if there is an imbalance of resources being allocated, which could cause performance issues. Memory Limits Used

Node Allocatable Pods Utilization

This metric indicates the percentage of pods that have been allocated on a node. It is calculated by dividing the total number of allocated pods on the node by the total number of pods that are allocatable for the node. This metric is important for understanding if the node has enough resources to meet the demands of its workloads. If the utilization is too high, it could lead to performance issues or even failures. Pods Utilization

Node Allocatable Pods Usage

This metric indicates the number of pods that have been allocated on the node, as well as the number of pods that are available for allocation. It is measured in pods, and can be used to determine the number of workloads running on the node and if the node has enough resources to satisfy the demands of its workloads. Monitoring this metric can help identify any imbalances in resource allocation, which could cause performance issues. Pods Usage

Node Condition

This metric indicates the current condition of the node. It is a dynamic metric and can take on different values based on the current state of the node. Monitoring this metric can help detect any changes in the node’s condition, which could lead to performance issues or even failure. Node Condition

Node Schedulability

This metric indicates whether or not a node is schedulable. It is measured in a state of either “schedulable” or “unschedulable” and can be used to determine if the node is able to accept any new workloads. Monitoring this metric can help identify if a node is overworked or underutilized. Node Schedulability

Node Pods Readiness

This metric indicates the percentage of pods that are ready on a node. It is calculated by dividing the number of ready pods on the node by the total number of pods on the node. This metric is important for understanding if the node has enough resources to meet the demands of its workloads. If the readiness is too low, it could lead to performance issues or even failures. Pods Readiness

Node Pods Readiness State

This metric indicates the number of ready and unready pods on the node. It is measured in pods, and can be used to determine the number of workloads that are ready to run on the node and if the node has enough resources to satisfy the demands of its workloads. Monitoring this metric can help identify any imbalances in resource allocation, which could cause performance issues. Pods Readiness State

Node Pods Condition

This metric indicates the current condition of the pods on the node. It is a dynamic metric and can take on different values based on the current state of the pods. Monitoring this metric can help detect any changes in the pods' condition, which could lead to performance issues or even failure. Pods Condition

Node Pods Phase

This metric indicates the phase of the pods on the node. It is measured in a state of either “running”, “failed”, “succeeded”, or “pending” and can be used to determine the status of the pods running on the node. Monitoring this metric can help identify any issues with the pods on the node, which could lead to performance issues or even failure. Pods Phase

Node Containers

This metric indicates the number of containers and init containers running on the node. It is measured in containers and can be used to determine the number of workloads running on the node. Monitoring this metric can help identify any imbalances in resource allocation, which could cause performance issues. Containers Workload

Node Containers State

This metric indicates the state of the containers on the node. It is measured in either “running”, “waiting”, or “terminated” and can be used to determine the status of the containers running on the node. Monitoring this metric can help identify any issues with the containers on the node, which could lead to performance issues or even failure. Containers State

Node Init Containers State

This metric indicates the state of the init containers on the node. It is measured in either “running”, “waiting”, or “terminated” and can be used to determine the status of the init containers running on the node. Monitoring this metric can help identify any issues with the init containers on the node, which could lead to performance issues or even failure. Init Containers State

Node Age

This metric indicates the age of the node. It is measured in seconds and can be used to determine the age of the node and if the node needs to be replaced or upgraded. Monitoring this metric can help identify if the node is outdated or not, which could lead to performance issues or even failure. Node Age

Pod CPU Requests Used

The amount of CPU requested by all containers within a pod. This value is the sum of the CPU requests for each container in the pod. Monitoring the amount of CPU requested by a pod allows for capacity planning, understanding of resource utilization, and workload optimization. If the CPU requests are consistently higher than the actual usage, there is an opportunity to reduce costs by reducing the requested resources. Pod CPU Requests Used

Pod CPU Limits Used

The amount of CPU allocated to all containers within a pod. This value is the sum of the CPU limits for each container in the pod. Monitoring the amount of CPU limits used by a pod allows for capacity planning, understanding of resource utilization, and workload optimization. If the CPU limits are consistently higher than the actual usage, there is an opportunity to reduce costs by reducing the requested resources. Pod CPU Limits Used

Pod Memory Requests Used

The amount of Memory requested by all containers within a pod. This value is the sum of the Memory requests for each container in the pod. Monitoring the amount of Memory requested by a pod allows for capacity planning, understanding of resource utilization, and workload optimization. If the Memory requests are consistently higher than the actual usage, there is an opportunity to reduce costs by reducing the requested resources. Pod Memory Requests Used

Pod Memory Limits Used

The amount of Memory allocated to all containers within a pod. This value is the sum of the Memory limits for each container in the pod. Monitoring the amount of Memory limits used by a pod allows for capacity planning, understanding of resource utilization, and workload optimization. If the Memory limits are consistently higher than the actual usage, there is an opportunity to reduce costs by reducing the requested resources. Pod Memory Limits Used

Pod Condition

The status of a pod. This value is the sum of the conditions of each container in the pod. Monitoring the condition of a pod allows for the identification of potential issues such as containers not being ready, containers not being scheduled, containers not being initialized, or containers not being ready. By monitoring this metric, potential issues can be identified and addressed before they cause any outages or performance issues. Pod Condition

Pod Phase

The phase of a pod. This value is the sum of the phases of each container in the pod. Monitoring the phase of a pod allows for the identification of potential issues such as containers being in a running phase, failed phase, succeeded phase, or pending phase. By monitoring this metric, potential issues can be identified and addressed before they cause any outages or performance issues. Pod Phase

Pod Age

The age of a pod. This value is the sum of the age of each container in the pod. Monitoring the age of a pod allows for the identification of potential issues such as containers being too old or too young. By monitoring this metric, potential issues can be identified and addressed before they cause any outages or performance issues. Pod Age

Pod Containers

The number of containers in a pod. This value is the sum of the containers in each container in the pod. Monitoring the number of containers in a pod allows for the identification of potential issues such as not having enough containers or having too many containers. By monitoring this metric, potential issues can be identified and addressed before they cause any outages or performance issues. Pod Containers

Pod Containers State

The state of containers in a pod. This value is the sum of the states of each container in the pod. Monitoring the state of containers in a pod allows for the identification of potential issues such as containers being in a running state, waiting state, or terminated state. By monitoring this metric, potential issues can be identified and addressed before they cause any outages or performance issues. Pod Containers State

Pod Init Containers State

The state of init containers in a pod. This value is the sum of the states of each init container in the pod. Monitoring the state of init containers in a pod allows for the identification of potential issues such as init containers being in a running state, waiting state, or terminated state. By monitoring this metric, potential issues can be identified and addressed before they cause any outages or performance issues. Pod Init Containers State

Pod Container Readiness State

The readiness state of a container in a pod. This value is the sum of the readiness states of each container in the pod. Monitoring the readiness state of a container in a pod allows for the identification of potential issues such as containers not being ready. By monitoring this metric, potential issues can be identified and addressed before they cause any outages or performance issues. Pod Container Readiness State

Pod Container Restarts

The number of restarts of a container in a pod. This value is the sum of the restarts of each container in the pod. Monitoring the number of restarts of a container in a pod allows for the identification of potential issues such as containers restarting too frequently. By monitoring this metric, potential issues can be identified and addressed before they cause any outages or performance issues. Pod Container Restarts

Pod Container State

The state of a container in a pod. This value is the sum of the states of each container in the pod. Monitoring the state of a container in a pod allows for the identification of potential issues such as containers being in a running state, waiting state, or terminated state. By monitoring this metric, potential issues can be identified and addressed before they cause any outages or performance issues. ![Pod Container State](image" src=“https://user-images.githubusercontent.com/96257330/215512422-639f434c-8265-4dbe-8046-704a8977188c.png)

Pod Container Waiting State Reason

The reason for container being in a waiting state in a pod. This value is the sum of the reasons for each container being in a waiting state in the pod. Monitoring the reason for container being in a waiting state in a pod allows for the identification of potential issues such as containers waiting for resources or containers waiting for an image pull. By monitoring this metric, potential issues can be identified and addressed before they cause any outages or performance issues. Container Waiting State Reason

Pod Container Terminated State Reason

The reason for container being in a terminated state in a pod. This value is the sum of the reasons for each container being in a terminated state in the pod. Monitoring the reason for container being in a terminated state in a pod allows for the identification of potential issues such as containers being terminated due to an error or containers being terminated due to an out of memory error. By monitoring this metric, potential issues can be identified and addressed before they cause any outages or performance issues. Container Terminated State Reason

Discovery Discoverers State

This metric collects the running discovers state Discoverers State

Get Netdata

Sign up for free

Want to see a demonstration of Netdata for multiple use cases?

Go to Live Demo