Observability

Metric Cardinality in Observability Platforms Compression and Roll-Up Strategies Compared

A deep dive into how modern TSDBs handle the label explosion problem and the trade-offs between storage- query speed- and data granularity

Metric Cardinality in Observability Platforms Compression and Roll-Up Strategies Compared

It’s a story familiar to any SRE or DevOps engineer. You add a seemingly innocuous label to a key metric—user_id, request_id, container_id—to gain deeper insight. Suddenly, your monitoring bill skyrockets, your Prometheus_TSDB instance starts gasping for memory, and dashboards slow to a crawl. You have just triggered a label_explosion, the single biggest challenge in modern metrics-based observability: metrics_cardinality.

High cardinality isn’t an edge case; it’s the new normal in a world of microservices, containers, and complex user interactions. As the number of unique time series grows into the millions or even billions, it places immense pressure on monitoring systems, impacting storage_costs, query performance, and the fundamental ability to scale.

There is no silver bullet for this problem, but two dominant strategies have emerged in the observability ecosystem: aggressive metric_compression and manual rollup_strategy. This guide will compare these two approaches, exploring how tools like Prometheus and Grafana_Mimir implement them, and discuss the critical trade-offs you make with each.

What is High-Cardinality and Why Is It a Problem?

In a time-series database (TSDB), a “time series” is a unique combination of a metric name and a set of key-value pairs called labels. Cardinality refers to the number of these unique combinations.

  • Low Cardinality: http_requests_total{method="GET", status="200", job="api-server"} This metric has a small, finite number of possible label combinations. Its cardinality is low and predictable.

  • High Cardinality: http_requests_total{..., client_ip="1.2.3.4", user_id="u-5678"} By adding labels with many unique values (IP addresses, user IDs, container IDs, session IDs), you create a combinatorial explosion. If you have 10,000 users making requests from 5,000 IP addresses, this single metric can generate millions of unique time series.

This label_explosion creates severe problems for most modern TSDBs:

  1. Massive Index Size: The biggest issue isn’t the raw data points; it’s the index. Every unique time series requires an entry in the database’s index so it can be found quickly. A massive index consumes vast amounts of RAM and disk space, driving up observability_cost.
  2. Slow Ingestion: The system struggles to process and index millions of new series arriving via remote_write or scraping, leading to ingestion delays and dropped data.
  3. Slow Queries: Queries that need to aggregate data across millions of series (e.g., sum(http_requests_total)) become painfully slow. The TSDB has to find every series in the index, load its data, and then perform the aggregation, consuming significant CPU and memory.

Strategy 1: Aggressive Compression (The Prometheus & Mimir Model)

Modern TSDBs like the Prometheus_TSDB and its horizontally-scalable implementations (Grafana_Mimir, Cortex, Thanos) don’t try to prevent high cardinality. Instead, they are engineered to manage its impact through sophisticated compression techniques.

In-Memory Index and Head Block

When new data arrives, it’s written to an in-memory “head block.” This is where active time series are indexed for fast writes and reads. High cardinality churns this head block rapidly, as new series are constantly being created. This leads to high memory usage, one of the first signs of a tsdb_scaling problem.

Persistent Storage and Chunk Compression

Periodically, the in-memory data is flushed to disk in immutable blocks, typically covering a two-hour window (tsdb_block_size). This is where metric_compression works its magic:

  • Data Point Compression: Timestamps and values are compressed using techniques like delta encoding and XOR, as pioneered by GorillaDB. This chunk_compression is incredibly efficient for the raw data points themselves.
  • Index Compression: The labels and series metadata are also heavily compressed using dictionaries and other methods to reduce the on-disk footprint.

However, while these techniques drastically reduce storage_costs, they do not solve the fundamental problem. The index, even when compressed on disk, is still logically massive. When you run a query, the system still has to decompress and process all that metadata to find the relevant series, which is why query performance degrades as cardinality grows. Compression makes storing the data feasible, but it doesn’t make querying it fast.

Strategy 2: Manual Roll-ups and Recording Rules (The Pre-aggregation Fix)

This is the traditional SRE approach to controlling cardinality. If you can’t afford to store and query the raw high_cardinality_metrics, you don’t. You pre-aggregate the data into a new, lower-cardinality metric.

The Power of Recording Rules

The primary tool for this in the Prometheus ecosystem is recording rules. These are queries that run at a regular interval (e.g., every minute), with the result saved as a new time series.

For example, imagine you have a high-cardinality metric for HTTP requests that includes a user_id label. You can create a rollup_strategy with a recording rule that removes this high-cardinality label and saves the aggregated total. This creates a new, much lower-cardinality metric.

Downsampling and Metrics Retention

This rollup_strategy is often paired with tiered metrics_retention policies. You might:

  1. Keep the raw, high-cardinality data for a short period (e.g., 24-48 hours) for fine-grained debugging.
  2. Drop the high-cardinality labels after that period.
  3. Keep the rolled-up, low-cardinality metrics for much longer (e.g., 13+ months) for long-term trending.

Systems like Grafana_Mimir and cortex_rollup provide dedicated components to manage this downsampling process efficiently at scale.

The Downsides of Manual Roll-ups

While effective at controlling costs and improving query performance, this approach has significant drawbacks:

  • Loss of Granularity: You’ve permanently thrown away the details. If an incident occurs and you need to know which user_id was causing a spike three days ago, you can’t. The raw data is gone. Exemplars can link back to raw traces, but they don’t solve the problem of being unable to query the metric itself.
  • Manual Toil: This process is incredibly brittle. Every time a new service introduces a high-cardinality metric, an SRE needs to manually write, test, and deploy a new recording rule. It’s a constant, error-prone battle to keep cardinality in check.
  • Inflexibility: You must decide what questions you will want to ask in the future, today. If you didn’t create a roll-up by_customer_id, you will never be able to answer questions about a specific customer’s experience from your long-term metrics.

A Third Way: Adaptive, Real-Time Solutions

The choice between raw, expensive data and cheap, aggregated data is a false dichotomy. Modern observability platforms are moving towards a model of adaptive_metrics that aims to provide the best of both worlds.

Netdata, for example, challenges this trade-off by moving intelligence to the edge. The Netdata Agent, running on each node, collects thousands of metrics every second. It can store this high-fidelity data locally for a configurable period, giving you raw, granular data for immediate debugging without any of the remote_write or ingestion bottlenecks.

When this data is streamed to a central backend, it can be intelligently downsampled over time without requiring manual recording rules. The system understands that as you “zoom out” on a chart from the last hour to the last month, you are interested in trends, not individual data points. This adaptive approach means:

  • You retain raw data for debugging when you need it most—in the recent past.
  • You get efficient, long-term storage for trends without manual configuration.
  • You avoid the “pre-aggregation trap”, maintaining the flexibility to explore your data in new ways without being limited by decisions you made months ago.

Other strategies like metric_deduplication, used by Mimir and Thanos, are also crucial for tsdb_scaling in high-availability setups, but they solve a different problem than cardinality itself.

Conclusion: Know Your Trade-Offs

There is no perfect solution for metrics_cardinality. The right strategy depends on your specific needs and budget.

  • The Compression-First Approach (Prometheus): This model gives you full detail but puts the burden on your query engine and your wallet. It’s powerful if you can afford the hardware and tolerate slower queries at scale.
  • The Roll-up Strategy (Mimir, Cortex): This model prioritizes query performance and storage_costs but sacrifices data granularity and requires significant manual SRE effort to maintain. It’s a practical choice for large-scale, cost-sensitive environments.
  • The Adaptive, Edge-First Approach (Netdata): This emerging model offers a compelling alternative, providing both high-fidelity raw data and long-term trends without the manual overhead of roll-ups, fundamentally changing the observability_cost equation.

Understanding these trade-offs is the first step toward building an observability stack that is not only powerful but also sustainable. As your systems grow, the way you handle cardinality will be the single most important factor determining the success of your monitoring strategy.

Experience a new approach to metrics without the cardinality headache. Try Netdata for free today.