It’s a story familiar to any SRE or DevOps engineer. You add a seemingly innocuous label to a key metric—user_id
, request_id
, container_id
—to gain deeper insight. Suddenly, your monitoring bill skyrockets, your Prometheus_TSDB
instance starts gasping for memory, and dashboards slow to a crawl. You have just triggered a label_explosion
, the single biggest challenge in modern metrics-based observability: metrics_cardinality
.
High cardinality isn’t an edge case; it’s the new normal in a world of microservices, containers, and complex user interactions. As the number of unique time series grows into the millions or even billions, it places immense pressure on monitoring systems, impacting storage_costs
, query performance, and the fundamental ability to scale.
There is no silver bullet for this problem, but two dominant strategies have emerged in the observability ecosystem: aggressive metric_compression
and manual rollup_strategy
. This guide will compare these two approaches, exploring how tools like Prometheus and Grafana_Mimir
implement them, and discuss the critical trade-offs you make with each.
What is High-Cardinality and Why Is It a Problem?
In a time-series database (TSDB), a “time series” is a unique combination of a metric name and a set of key-value pairs called labels. Cardinality refers to the number of these unique combinations.
-
Low Cardinality:
http_requests_total{method="GET", status="200", job="api-server"}
This metric has a small, finite number of possible label combinations. Its cardinality is low and predictable. -
High Cardinality:
http_requests_total{..., client_ip="1.2.3.4", user_id="u-5678"}
By adding labels with many unique values (IP addresses, user IDs, container IDs, session IDs), you create a combinatorial explosion. If you have 10,000 users making requests from 5,000 IP addresses, this single metric can generate millions of unique time series.
This label_explosion
creates severe problems for most modern TSDBs:
- Massive Index Size: The biggest issue isn’t the raw data points; it’s the index. Every unique time series requires an entry in the database’s index so it can be found quickly. A massive index consumes vast amounts of RAM and disk space, driving up
observability_cost
. - Slow Ingestion: The system struggles to process and index millions of new series arriving via
remote_write
or scraping, leading to ingestion delays and dropped data. - Slow Queries: Queries that need to aggregate data across millions of series (e.g.,
sum(http_requests_total)
) become painfully slow. The TSDB has to find every series in the index, load its data, and then perform the aggregation, consuming significant CPU and memory.
Strategy 1: Aggressive Compression (The Prometheus & Mimir Model)
Modern TSDBs like the Prometheus_TSDB
and its horizontally-scalable implementations (Grafana_Mimir
, Cortex, Thanos) don’t try to prevent high cardinality. Instead, they are engineered to manage its impact through sophisticated compression techniques.
In-Memory Index and Head Block
When new data arrives, it’s written to an in-memory “head block.” This is where active time series are indexed for fast writes and reads. High cardinality churns this head block rapidly, as new series are constantly being created. This leads to high memory usage, one of the first signs of a tsdb_scaling
problem.
Persistent Storage and Chunk Compression
Periodically, the in-memory data is flushed to disk in immutable blocks, typically covering a two-hour window (tsdb_block_size
). This is where metric_compression
works its magic:
- Data Point Compression: Timestamps and values are compressed using techniques like delta encoding and XOR, as pioneered by GorillaDB. This
chunk_compression
is incredibly efficient for the raw data points themselves. - Index Compression: The labels and series metadata are also heavily compressed using dictionaries and other methods to reduce the on-disk footprint.
However, while these techniques drastically reduce storage_costs
, they do not solve the fundamental problem. The index, even when compressed on disk, is still logically massive. When you run a query, the system still has to decompress and process all that metadata to find the relevant series, which is why query performance degrades as cardinality grows. Compression makes storing the data feasible, but it doesn’t make querying it fast.
Strategy 2: Manual Roll-ups and Recording Rules (The Pre-aggregation Fix)
This is the traditional SRE approach to controlling cardinality. If you can’t afford to store and query the raw high_cardinality_metrics
, you don’t. You pre-aggregate the data into a new, lower-cardinality metric.
The Power of Recording Rules
The primary tool for this in the Prometheus ecosystem is recording rules. These are queries that run at a regular interval (e.g., every minute), with the result saved as a new time series.
For example, imagine you have a high-cardinality metric for HTTP requests that includes a user_id
label. You can create a rollup_strategy
with a recording rule that removes this high-cardinality label and saves the aggregated total. This creates a new, much lower-cardinality metric.
Downsampling and Metrics Retention
This rollup_strategy
is often paired with tiered metrics_retention
policies. You might:
- Keep the raw, high-cardinality data for a short period (e.g., 24-48 hours) for fine-grained debugging.
- Drop the high-cardinality labels after that period.
- Keep the rolled-up, low-cardinality metrics for much longer (e.g., 13+ months) for long-term trending.
Systems like Grafana_Mimir
and cortex_rollup
provide dedicated components to manage this downsampling
process efficiently at scale.
The Downsides of Manual Roll-ups
While effective at controlling costs and improving query performance, this approach has significant drawbacks:
- Loss of Granularity: You’ve permanently thrown away the details. If an incident occurs and you need to know which
user_id
was causing a spike three days ago, you can’t. The raw data is gone.Exemplars
can link back to raw traces, but they don’t solve the problem of being unable to query the metric itself. - Manual Toil: This process is incredibly brittle. Every time a new service introduces a high-cardinality metric, an SRE needs to manually write, test, and deploy a new recording rule. It’s a constant, error-prone battle to keep cardinality in check.
- Inflexibility: You must decide what questions you will want to ask in the future, today. If you didn’t create a roll-up
by_customer_id
, you will never be able to answer questions about a specific customer’s experience from your long-term metrics.
A Third Way: Adaptive, Real-Time Solutions
The choice between raw, expensive data and cheap, aggregated data is a false dichotomy. Modern observability platforms are moving towards a model of adaptive_metrics
that aims to provide the best of both worlds.
Netdata, for example, challenges this trade-off by moving intelligence to the edge. The Netdata Agent, running on each node, collects thousands of metrics every second. It can store this high-fidelity data locally for a configurable period, giving you raw, granular data for immediate debugging without any of the remote_write
or ingestion bottlenecks.
When this data is streamed to a central backend, it can be intelligently downsampled over time without requiring manual recording rules. The system understands that as you “zoom out” on a chart from the last hour to the last month, you are interested in trends, not individual data points. This adaptive approach means:
- You retain raw data for debugging when you need it most—in the recent past.
- You get efficient, long-term storage for trends without manual configuration.
- You avoid the “pre-aggregation trap”, maintaining the flexibility to explore your data in new ways without being limited by decisions you made months ago.
Other strategies like metric_deduplication
, used by Mimir and Thanos, are also crucial for tsdb_scaling
in high-availability setups, but they solve a different problem than cardinality itself.
Conclusion: Know Your Trade-Offs
There is no perfect solution for metrics_cardinality
. The right strategy depends on your specific needs and budget.
- The Compression-First Approach (Prometheus): This model gives you full detail but puts the burden on your query engine and your wallet. It’s powerful if you can afford the hardware and tolerate slower queries at scale.
- The Roll-up Strategy (Mimir, Cortex): This model prioritizes query performance and
storage_costs
but sacrifices data granularity and requires significant manual SRE effort to maintain. It’s a practical choice for large-scale, cost-sensitive environments. - The Adaptive, Edge-First Approach (Netdata): This emerging model offers a compelling alternative, providing both high-fidelity raw data and long-term trends without the manual overhead of roll-ups, fundamentally changing the
observability_cost
equation.
Understanding these trade-offs is the first step toward building an observability stack that is not only powerful but also sustainable. As your systems grow, the way you handle cardinality will be the single most important factor determining the success of your monitoring strategy.
Experience a new approach to metrics without the cardinality headache. Try Netdata for free today.