Cassandra secondary index pitfalls: when 2i scatters across the ring

A query on an indexed column behaves differently in Cassandra than in a relational database. Without a partition key to anchor the request, the coordinator cannot hash the value to a token range and choose the right replicas. It fans out to every node in the cluster, waits for each local index scan to complete, and collates the results. Adding nodes can make indexed queries slower because scatter is linear with cluster size.

Native 2i, SASI, and Storage Attached Indexes (SAI) in Cassandra 5.0 each handle this differently, but only SAI eliminates the unconditional scatter. Most production incidents involving secondary indexes trace back to one of three mistakes: indexing a high-cardinality column, ignoring the tombstone abort limit, or adding ALLOW FILTERING to force a full scan.

What it is and why it matters

A native secondary index (2i) is not a global structure. CREATE INDEX builds a hidden table on each node that maps the indexed column value back to partition keys stored locally on that node. Because Cassandra distributes data by partition key hash, the index is local and uncoordinated.

When a query includes the partition key, the coordinator knows exactly which replicas own the data. When a query filters on an indexed column without the partition key, the coordinator has no token range to target. It must ask every node to scan its local index, even if the answer is zero rows. The coordinator then merges the responses, deduplicates, applies limits, and returns the result set to the client.

This design optimizes for writes. The index is maintained locally as part of the write path with no cross-node coordination. The cost is shifted entirely to read time.

How it works

The request flow for a native 2i query without a partition key qualifier:

  1. The client sends a CQL query such as SELECT * FROM users WHERE email = '[email protected]' to the coordinator.
  2. The coordinator parses the query and detects a secondary index restriction with no partition key equality.
  3. Because the coordinator cannot compute a token range from email, it forwards the index scan request to every node in the cluster.
  4. Each node consults its local index table, an LSM-backed structure subject to the same compaction, bloom filters, and tombstone rules as any other table.
  5. Each node returns matching partition keys and row fragments to the coordinator.
  6. The coordinator collates results, resolves conflicts using timestamps, applies LIMIT, and streams the final result to the client.
flowchart LR
    Client([Client])
    Coordinator([Coordinator])
    Node1([Node 1])
    Node2([Node 2])
    NodeN([Node N])

    Client -->|SELECT WHERE email = x| Coordinator
    Coordinator -->|local index scan| Node1
    Coordinator -->|local index scan| Node2
    Coordinator -->|local index scan| NodeN
    Node1 -->|matches| Coordinator
    Node2 -->|matches| Coordinator
    NodeN -->|matches| Coordinator
    Coordinator -->|collate & limit| Client

High-cardinality columns amplify the waste. If email is nearly unique, almost every node scans its entire local index for zero results. Low-cardinality columns (for example, a status field with five values) are the intended use case, but even then the coordinator must touch every node. Ultra-low cardinality, such as a boolean, creates a large index that returns many rows from every node and becomes wasteful.

The index also inherits tombstones from the base table. A delete in the base table writes a tombstone to the hidden index table. When a query scans the index and encounters more than tombstone_failure_threshold cells (default 100,000), Cassandra aborts the query. Index tables are subject to compaction and gc_grace_seconds, but delete-heavy workloads can accumulate tombstones faster than compaction purges them, especially when repairs lag.

Where it shows up in production

High-cardinality lookups. Querying by a nearly unique value, such as a user ID, device serial number, or UUID, is the most common trap. The query returns one row but contacts every node in the cluster to find it.

Coordinator bottlenecks during paging. A query with LIMIT 10 does not ask ten nodes for ten rows each. The coordinator must collect enough rows from all nodes to satisfy the limit, sort them, and discard the excess. In large clusters this collates responses from hundreds of data partitions and can saturate the coordinator heap or network buffers.

Delete-heavy access patterns. Frequently updated or deleted columns accumulate tombstones in the index. When the 100,000 tombstone threshold is reached, the query fails outright. Because index tombstones may outpace compaction if repairs are delayed or gc_grace_seconds is long, the only immediate mitigation is often to drop and rebuild the index, or redesign the table.

ALLOW FILTERING as a workaround. When Cassandra rejects a query for performance reasons, developers sometimes add ALLOW FILTERING. This forces the database to scan the entire cluster and filter rows locally. Query time scales with total data volume, not result set size. A query that returns ten rows may scan ten million. In query logs, any occurrence of ALLOW FILTERING is a paging and latency incident waiting to happen. Use it only when the table is provably small.

Adding nodes makes it worse. Because 2i scatter is linear with cluster size, expanding the ring increases the number of nodes contacted per query. The feature penalizes horizontal scaling.

Tradeoffs and when to use it

The choice between indexing strategies depends on cardinality, version, and willingness to manage application-side denormalization.

ApproachRouting behaviorCardinality guidanceBuild impactVersion status
Native 2iFans out to every nodeLow only (e.g., status, category)Single-threaded per column; CPU and I/O intensiveAll versions
SASIExperimental local indexesPrefix, contains, sparse modesAdditional compaction overhead; requires Murmur3PartitionerDeprecated in 5.0; targeted for removal in 6.0
SAIRoutes to relevant replicas onlyLow to moderate; numerics, vectors, textDoes not block non-index queries; segment write buffer defaults to 1024 MB5.0+; recommended for new workloads
Query tableExact partition key lookup; no scatterAny cardinalityApplication-managed dual writesAll versions

Native 2i. Use it only for low-cardinality columns on small tables where scatter latency is acceptable. A 2i index cannot be altered; changing its definition requires dropping and recreating it, which triggers a full rebuild.

SASI. Do not deploy SASI for new workloads. It is deprecated in Cassandra 5.0 and targeted for removal in 6.0. Not-equals and OR predicate support were temporarily removed in 5.0, and it requires the Murmur3Partitioner.

SAI. In Cassandra 5.0, Storage Attached Indexes are the default and recommended choice. SAI indexes are co-located with SSTable data, use a KD-tree for numerics and a byte-ordered trie for strings, and support vector search. The coordinator selects the most selective index and routes requests to relevant replicas only, eliminating the blind scatter. SAI indexes stream zero-copy during bootstrap and decommission. However, do not mix 2i and SAI on the same table; doing so makes performance tuning opaque.

Query tables. For high-cardinality lookups, the safest pattern is a manually maintained query table. The application writes the lookup key as the partition key in a separate table. This avoids scatter entirely and gives predictable, single-node latency at the cost of dual writes and application complexity.

If your cluster uses custom authorization with secondary indexes, patch to 4.0.16+, 4.1.8+, or 5.0.3+ to address privilege escalation and authorization bypass CVEs disclosed in early 2025 that affect index mutation paths.

Signals to watch in production

Monitor the hidden index table via nodetool compactionstats during index builds. A sustained build on a large base table appears as a long-running compaction on the index table and consumes single-threaded CPU. You can also query system_schema.indexes to confirm index definitions and options before correlating with compaction activity.

SignalWhy it mattersWarning sign
Coordinator read latency (p99)Scatter queries inflate tail latency across every participating nodep99 > 3x baseline during indexed query windows
Client request timeoutsFan-out amplifies the probability that at least one replica is slowTimeout rate rising after index query deployment
Thread pool pending tasks (ReadStage)Index scans consume local read threads on every nodeSustained pending > 0 on multiple nodes concurrently
Tombstone scan warnings / aborted readsIndex tables inherit tombstones from base table deletesTombstoneOverwhelmingException or sustained warnings
Dropped messages (READ)Slow index responses expire in replica queuesNon-zero dropped READ rate correlated with index queries
SAI query timeouts (5.0+)SAI reduces scatter but can still pressure nodesTotalQueryTimeouts increasing under JMX StorageAttachedIndex
SAI index build state (5.0+)Queries targeting a rebuilding SAI index are rejectedis_queryable=false in system_views.indexes
ALLOW FILTERING in query logsIndicates full scans that bypass even limited index pruningAny appearance in production requires review

For SAI-specific capacity planning, also watch DiskPercentageOfBaseTable, SegmentBufferSpaceUsedBytes, and KDTreeChunkCacheMisses under the org.apache.cassandra.metrics:type=StorageAttachedIndex MBean. High cache miss rates on numeric range queries indicate the chunk cache is undersized.

How Netdata helps

  • Correlate coordinator read latency spikes with the rate of secondary index queries to confirm scatter-induced pressure.
  • Track JVM heap usage and GC pause duration alongside 2i index builds, which are single-threaded and CPU intensive.
  • Monitor disk I/O saturation across all nodes during scatter queries, because every node participates in a 2i read.
  • Alert on tombstone scan thresholds and dropped READ messages that correlate with hidden index table pressure.
  • For Cassandra 5.0 with SAI, surface JMX metrics from StorageAttachedIndex to detect query timeouts and cache misses before they cascade into client-visible latency.