Cassandra CorruptSSTableException and FSError: disk failure and recovery

A Cassandra node stops serving traffic or refuses to start. In system.log you see org.apache.cassandra.io.sstable.CorruptSSTableException, FSError, or a JVM shutdown triggered by a filesystem exception. These indicate disk failure, filesystem corruption, or irreversible SSTable damage, not retryable application bugs.

Because SSTables are immutable, a corrupt file cannot be patched. The node either stops serving the data or shuts down, depending on disk_failure_policy. Recovery requires at least one healthy replica. Without that, corruption is data loss.

This guide covers confirming the failure, determining whether the node failed at startup or runtime, and replacing damaged SSTables safely.

What this means

CorruptSSTableException signals an internal consistency failure inside an SSTable: checksum mismatch, corrupt block, malformed row, or unexpected EOF. FSError wraps lower-level filesystem errors such as I/O errors, permission denied, or unavailable disk. Both trigger the configured disk_failure_policy.

The default policy is stop. A storage exception shuts down gossip and native transport; the node remains alive but stops serving client traffic. Other values:

  • die - shut down the JVM immediately.
  • best_effort - stop using the failed disk directory and continue on the remaining directories for that restart.
  • ignore - log and continue. Requests fail silently. Never use in production.

The JMX counter org.apache.cassandra.metrics:type=Storage,name=Exceptions tracks uncaught storage subsystem errors. Any non-zero rate, plus any CorruptSSTableException or FSError in the logs, is a PAGE-level event.

flowchart TD
    A[Log shows CorruptSSTableException] --> B{Startup or runtime?}
    B -->|Startup| C[Node fails to join]
    B -->|Runtime| D[Node stops serving traffic]
    C --> E[Find bad SSTable path]
    D --> E
    E --> F[Run nodetool verify]
    F --> G[Remove SSTable and repair]

Common causes

CauseWhat it looks likeFirst thing to check
Disk hardware failureFSError or IOError in system.log; node exits or stops transportOS logs (dmesg) and SMART status for disk errors
SSTable corruption from failed compaction or bit rotCorruptSSTableException naming a specific SSTable path during startup or readsThe exact file prefix in the exception message
Unexpected shutdown during write or compactionCorruptSSTableException, EOFException, or checksum errors after a crashNode uptime, OOM kills, or power events preceding the failure
Filesystem or kernel faultFSError without accompanying SMART errorsKernel logs and filesystem consistency

Startup vs runtime behavior

The recovery path depends on when Cassandra encounters the error.

Startup failure. If an SSTable is corrupt during initialization, Cassandra logs the path and either fails to start (die) or aborts startup of the storage layer (stop). The exception stack trace includes the full path to the component file. Note the keyspace, table name, and SSTable generation number.

Runtime failure. If the error occurs during a read or compaction on a live node, disk_failure_policy determines the result. With stop, the node halts gossip and native transport but remains running. The exception is logged with the SSTable path. Client connections drop, but the JVM stays up, which can aid diagnostics. With die, the JVM exits and the node leaves the ring.

Confirming the failure

Before replacing data, confirm the failure at both the OS and Cassandra layers.

  1. Inspect system.log for the exact exception. A CorruptSSTableException includes the SSTable path. An FSError includes the underlying cause (for example, java.io.IOError or java.io.FileNotFoundException with I/O details).
  2. Check OS-level disk health. Run dmesg for I/O errors, bus resets, or filesystem remounts. Run smartctl against the physical device to check reallocated sectors, pending sectors, or command timeouts. These are read-only checks.
  3. Check filesystem consistency. If the OS reports errors, run a filesystem check. Warning: do not run fsck on a mounted read-write filesystem. Schedule maintenance or boot into recovery mode.
  4. Verify the SSTable. On a running node that is still up (runtime failure with policy stop), run nodetool verify against the table to force checksum validation. If the node is down, nodetool is unavailable.
  5. Check replica availability. Before removing any local data, confirm that other replicas are healthy and current. Run nodetool status to ensure the replica count and node state are normal. If replication factor is 1 or other replicas are down, removing the SSTable causes data loss.

Recovering from corruption

Once you have identified the corrupt SSTable and confirmed healthy replicas exist, quarantine the files and repair the node.

Quarantine the SSTable. Move all files sharing the corrupt SSTable prefix out of the data directory and into a quarantine directory. Do not delete them until the repair completes and you confirm data consistency. The exception message names the base filename; move every file with that prefix. If multiple SSTables are corrupt, quarantine all of them before restarting. Starting the node with a partially removed SSTable (for example, leaving an index or summary file behind) will trigger new errors.

Warning: Do not remove SSTable files while Cassandra is running. If the node is up but transport is stopped, stop Cassandra before moving files to avoid file-handle issues or additional crashes.

Restart the node. If the node was down, start Cassandra. With the corrupt files removed, it should join the ring. If disk_failure_policy was stop and the node was still running, restart Cassandra to re-enable gossip and native transport. Verify that nodetool status shows the node as UN and that no new exceptions appear.

Repair the data. Run a repair on the affected keyspace so the node streams replacement SSTables from healthy replicas. A full repair is usually required unless incremental repair is already enabled for the table. Monitor nodetool netstats during the streaming phase to confirm data is moving from the correct replicas. After repair completes, run nodetool verify on the repaired table to confirm the new local SSTables pass validation. Once confirmed, you can safely delete the quarantined files.

Warning: Repair generates significant cluster load. Run it during low-traffic windows and monitor for streaming errors or compaction backlog.

If replication factor is 1. You have no replica source. If the SSTable is unrecoverable, the data is lost. Attempt filesystem-level recovery or restore from backup before starting Cassandra without the file.

Disk failure policy considerations

The choice of disk_failure_policy changes the recovery steps.

  • stop (default): The node becomes unavailable for traffic but stays running. This limits client-visible errors but requires a restart after you fix the underlying issue.
  • die: The JVM exits immediately. You lose any in-flight mutations not yet flushed, and you must restart the node. This is the safest option if you prefer fast failure over limping.
  • best_effort: Cassandra blacklists the failed directory for the current session and continues using remaining directories. This is only useful when multiple data_file_directories are configured. Be aware that best_effort can mask creeping disk failure if one directory degrades while others appear healthy.
  • ignore: The node logs the error and continues. Reads and writes fail silently or return partial data. Do not use ignore in production.

Monitoring and prevention

Detect these failures before they force a node outage.

  • Alert on the JMX Storage/Exceptions counter. Any sustained increase indicates hardware or filesystem issues.
  • Alert on CorruptSSTableException and FSError strings in system.log.
  • Monitor OS disk health metrics (SMART attributes, filesystem errors, dmesg I/O errors) alongside Cassandra logs. A CorruptSSTableException without a preceding OS error suggests bit rot or a Cassandra-level bug; an FSError usually indicates hardware.
  • Keep disk_failure_policy at stop or die. Use best_effort only when you have multiple independent data directories and understand the failure-isolation behavior.
  • Correlate Cassandra errors with disk I/O metrics. If you are using Netdata, compare the time of the exception against the disk.await chart for the affected volume. Sustained latency spikes or I/O error counters that align with FSError logs confirm hardware degradation.
  • Schedule regular nodetool verify on critical tables during maintenance windows. This catches bit rot and compaction defects before they cause a runtime failure.