Netdata Solutions

Chaos engineering with Netdata

Apache HTTP
Gremlin

Challenges of chaos engineering

In 2017, 98% of organizations reported that an outage of just an hour could cost well beyond $100,000. To proactively build resiliency into their systems, and thus confidence in the environment, more organizations are engaging in chaos engineering using platforms like Gremlin, Chaos Monkey, Pumba, and others. But to actually extract valuable conclusions from running any chaos engineering experiment, you need to have robust visibility into the second and tertiary effects of a given slowdown, crash, or networking fault.

Chaos engineering is all about quick iteration and constant improvement. If you’re putting hours into every experiment to ensure you’re collecting the right metrics, properly querying the time-series database, and organizing series into meaningful visualizations, that’s time lost in designing experiments and making data-driven conclusions. By pairing up your chaos engineering efforts with a monitoring solution that installs in seconds and requires no configuration, you can spend more time on building resiliency into your infrastructure and less time on setting up charts.

CPU usage monitoring of a web server with Netdata

How Netdata enables infrastructure-wide chaos engineering

Certain failures are intermittent bursts, lasting only a few seconds, but causing serious repercussions. These are completely invisible to monitoring solutions with 10-second granularity. By contrast, Netdata gathers and visualizes all metrics every second, giving you the highest-resolution understanding of how a chaos engineering experiment impacts your environment. Per-second granularity helps you observe secondary and tertiary effects with ease, such as how an increase in a node’s CPU utilization could also cause slow MySQL queries, which in turn causes 503 errors from the Nginx reverse proxy.

Netdata’s zero-configuration approach, with hundreds of data sources, lets you focus solely on designing your chaos engineering experiments and extracting the most value from their results.

  • In container environments, Netdata autodetects both the services and the respective containers, so we can see the effects of attacking distinct containers that run certain parts of the application, such as databases.
  • In Kubernetes, Netdata not only autodetects available services and pods, but also offers beautiful zero-configuration visualizations of the entire Kubernetes cluster. See which pods start failing, then drill down to see exactly how with per-second application metrics.
  • Quickly view any number of virtual machines (VMs) on a single pane of glass using Netdata Cloud’s Overview screen. When attacking a single machine (e.g., a simulated database crash) in your MySQL cluster, Netdata instantly reveals both the impact on the rest of the MySQL hosts and the sharding of the database.

With hundreds of preconfigured alarms and centralized alarm notifications in Netdata Cloud, you can easily find nodes affected by the chaos engineering attack, then drill down to find secondary and tertiary effects on service and system metrics.

The impact of chaos engineering with Netdata

Focus on your chaos engineering experiments while Netdata takes care on visualizing your infrastructure and applications. Quickly move from the business repercussions to drilling down into your infrastructure to understand the dependencies between services in minutes while feeling confident that you’re not missing any metrics.

  • Let anyone run chaos engineering, from junior frontend developers to hardened Linux system administrators. Netdata’s zero-configuration and autodetection features help anyone go quickly from idea to design and visualization.
  • Observe every system, container, and application with hundreds of collectors for your environment’s mission-critical applications, plus multiple virtualization methods (like LXC containers).
  • Run chaos engineering against every environment, from on-premises bare metal, VMs running from a datacenter on the other side of the planet, or container/Kubernetes deployments on platforms like Google Kubernetes Engine (GKE) or Amazon Elastic Kubernetes Service (EKS).
  • Test the validity and severity level of your alarms with Netdata’s hundreds of preconfigured alarms, then modify them according to your environment’s unique needs. You can either use the email-based centralized Netdata alarms or configure any number of custom alarm services, such as Slack or Opsgenie.
  • Measure the impact of your chaos engineering experiment and easily drill down from the high-level business impact to the miniscule technical detail that can make your systems more robust and fault tolerant. By collecting every metric, every second, you bring certainty to your chaos engineering. Observe systems you expect to fail and those you don’t.
  • Extend with the Netdata API, which you can easily use to access any collected metric. Add chaos engineering experiments to your CI/CD pipelines that automatically fail or pass based on collected metrics that you access through the Netdata API.