Cassandra secondary index pitfalls: when 2i scatters across the ring
A query on an indexed column behaves differently in Cassandra than in a relational database. Without a partition key to anchor the request, the coordinator cannot hash the value to a token range and choose the right replicas. It fans out to every node in the cluster, waits for each local index scan to complete, and collates the results. Adding nodes can make indexed queries slower because scatter is linear with cluster size.
Native 2i, SASI, and Storage Attached Indexes (SAI) in Cassandra 5.0 each handle this differently, but only SAI eliminates the unconditional scatter. Most production incidents involving secondary indexes trace back to one of three mistakes: indexing a high-cardinality column, ignoring the tombstone abort limit, or adding ALLOW FILTERING to force a full scan.
What it is and why it matters
A native secondary index (2i) is not a global structure. CREATE INDEX builds a hidden table on each node that maps the indexed column value back to partition keys stored locally on that node. Because Cassandra distributes data by partition key hash, the index is local and uncoordinated.
When a query includes the partition key, the coordinator knows exactly which replicas own the data. When a query filters on an indexed column without the partition key, the coordinator has no token range to target. It must ask every node to scan its local index, even if the answer is zero rows. The coordinator then merges the responses, deduplicates, applies limits, and returns the result set to the client.
This design optimizes for writes. The index is maintained locally as part of the write path with no cross-node coordination. The cost is shifted entirely to read time.
How it works
The request flow for a native 2i query without a partition key qualifier:
- The client sends a CQL query such as
SELECT * FROM users WHERE email = '[email protected]'to the coordinator. - The coordinator parses the query and detects a secondary index restriction with no partition key equality.
- Because the coordinator cannot compute a token range from
email, it forwards the index scan request to every node in the cluster. - Each node consults its local index table, an LSM-backed structure subject to the same compaction, bloom filters, and tombstone rules as any other table.
- Each node returns matching partition keys and row fragments to the coordinator.
- The coordinator collates results, resolves conflicts using timestamps, applies
LIMIT, and streams the final result to the client.
flowchart LR
Client([Client])
Coordinator([Coordinator])
Node1([Node 1])
Node2([Node 2])
NodeN([Node N])
Client -->|SELECT WHERE email = x| Coordinator
Coordinator -->|local index scan| Node1
Coordinator -->|local index scan| Node2
Coordinator -->|local index scan| NodeN
Node1 -->|matches| Coordinator
Node2 -->|matches| Coordinator
NodeN -->|matches| Coordinator
Coordinator -->|collate & limit| ClientHigh-cardinality columns amplify the waste. If email is nearly unique, almost every node scans its entire local index for zero results. Low-cardinality columns (for example, a status field with five values) are the intended use case, but even then the coordinator must touch every node. Ultra-low cardinality, such as a boolean, creates a large index that returns many rows from every node and becomes wasteful.
The index also inherits tombstones from the base table. A delete in the base table writes a tombstone to the hidden index table. When a query scans the index and encounters more than tombstone_failure_threshold cells (default 100,000), Cassandra aborts the query. Index tables are subject to compaction and gc_grace_seconds, but delete-heavy workloads can accumulate tombstones faster than compaction purges them, especially when repairs lag.
Where it shows up in production
High-cardinality lookups. Querying by a nearly unique value, such as a user ID, device serial number, or UUID, is the most common trap. The query returns one row but contacts every node in the cluster to find it.
Coordinator bottlenecks during paging. A query with LIMIT 10 does not ask ten nodes for ten rows each. The coordinator must collect enough rows from all nodes to satisfy the limit, sort them, and discard the excess. In large clusters this collates responses from hundreds of data partitions and can saturate the coordinator heap or network buffers.
Delete-heavy access patterns. Frequently updated or deleted columns accumulate tombstones in the index. When the 100,000 tombstone threshold is reached, the query fails outright. Because index tombstones may outpace compaction if repairs are delayed or gc_grace_seconds is long, the only immediate mitigation is often to drop and rebuild the index, or redesign the table.
ALLOW FILTERING as a workaround. When Cassandra rejects a query for performance reasons, developers sometimes add ALLOW FILTERING. This forces the database to scan the entire cluster and filter rows locally. Query time scales with total data volume, not result set size. A query that returns ten rows may scan ten million. In query logs, any occurrence of ALLOW FILTERING is a paging and latency incident waiting to happen. Use it only when the table is provably small.
Adding nodes makes it worse. Because 2i scatter is linear with cluster size, expanding the ring increases the number of nodes contacted per query. The feature penalizes horizontal scaling.
Tradeoffs and when to use it
The choice between indexing strategies depends on cardinality, version, and willingness to manage application-side denormalization.
| Approach | Routing behavior | Cardinality guidance | Build impact | Version status |
|---|---|---|---|---|
| Native 2i | Fans out to every node | Low only (e.g., status, category) | Single-threaded per column; CPU and I/O intensive | All versions |
| SASI | Experimental local indexes | Prefix, contains, sparse modes | Additional compaction overhead; requires Murmur3Partitioner | Deprecated in 5.0; targeted for removal in 6.0 |
| SAI | Routes to relevant replicas only | Low to moderate; numerics, vectors, text | Does not block non-index queries; segment write buffer defaults to 1024 MB | 5.0+; recommended for new workloads |
| Query table | Exact partition key lookup; no scatter | Any cardinality | Application-managed dual writes | All versions |
Native 2i. Use it only for low-cardinality columns on small tables where scatter latency is acceptable. A 2i index cannot be altered; changing its definition requires dropping and recreating it, which triggers a full rebuild.
SASI. Do not deploy SASI for new workloads. It is deprecated in Cassandra 5.0 and targeted for removal in 6.0. Not-equals and OR predicate support were temporarily removed in 5.0, and it requires the Murmur3Partitioner.
SAI. In Cassandra 5.0, Storage Attached Indexes are the default and recommended choice. SAI indexes are co-located with SSTable data, use a KD-tree for numerics and a byte-ordered trie for strings, and support vector search. The coordinator selects the most selective index and routes requests to relevant replicas only, eliminating the blind scatter. SAI indexes stream zero-copy during bootstrap and decommission. However, do not mix 2i and SAI on the same table; doing so makes performance tuning opaque.
Query tables. For high-cardinality lookups, the safest pattern is a manually maintained query table. The application writes the lookup key as the partition key in a separate table. This avoids scatter entirely and gives predictable, single-node latency at the cost of dual writes and application complexity.
If your cluster uses custom authorization with secondary indexes, patch to 4.0.16+, 4.1.8+, or 5.0.3+ to address privilege escalation and authorization bypass CVEs disclosed in early 2025 that affect index mutation paths.
Signals to watch in production
Monitor the hidden index table via nodetool compactionstats during index builds. A sustained build on a large base table appears as a long-running compaction on the index table and consumes single-threaded CPU. You can also query system_schema.indexes to confirm index definitions and options before correlating with compaction activity.
| Signal | Why it matters | Warning sign |
|---|---|---|
| Coordinator read latency (p99) | Scatter queries inflate tail latency across every participating node | p99 > 3x baseline during indexed query windows |
| Client request timeouts | Fan-out amplifies the probability that at least one replica is slow | Timeout rate rising after index query deployment |
| Thread pool pending tasks (ReadStage) | Index scans consume local read threads on every node | Sustained pending > 0 on multiple nodes concurrently |
| Tombstone scan warnings / aborted reads | Index tables inherit tombstones from base table deletes | TombstoneOverwhelmingException or sustained warnings |
| Dropped messages (READ) | Slow index responses expire in replica queues | Non-zero dropped READ rate correlated with index queries |
| SAI query timeouts (5.0+) | SAI reduces scatter but can still pressure nodes | TotalQueryTimeouts increasing under JMX StorageAttachedIndex |
| SAI index build state (5.0+) | Queries targeting a rebuilding SAI index are rejected | is_queryable=false in system_views.indexes |
| ALLOW FILTERING in query logs | Indicates full scans that bypass even limited index pruning | Any appearance in production requires review |
For SAI-specific capacity planning, also watch DiskPercentageOfBaseTable, SegmentBufferSpaceUsedBytes, and KDTreeChunkCacheMisses under the org.apache.cassandra.metrics:type=StorageAttachedIndex MBean. High cache miss rates on numeric range queries indicate the chunk cache is undersized.
How Netdata helps
- Correlate coordinator read latency spikes with the rate of secondary index queries to confirm scatter-induced pressure.
- Track JVM heap usage and GC pause duration alongside 2i index builds, which are single-threaded and CPU intensive.
- Monitor disk I/O saturation across all nodes during scatter queries, because every node participates in a 2i read.
- Alert on tombstone scan thresholds and dropped READ messages that correlate with hidden index table pressure.
- For Cassandra 5.0 with SAI, surface JMX metrics from
StorageAttachedIndexto detect query timeouts and cache misses before they cascade into client-visible latency.
Related guides
- Cassandra adding and removing nodes safely: vnodes, tokens, and cleanup
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra dropped mutations: silent write loss and load shedding
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills







