MySQL stuck in InnoDB crash recovery: why startup hangs after an unclean shutdown
After an unclean shutdown, mysqld may look healthy: the process is running, port 3306 accepts connections, and SHOW GLOBAL STATUS LIKE 'Uptime' counts up. Yet applications time out, SELECT 1 hangs, and the error log prints InnoDB recovery messages. InnoDB is replaying the redo log to make data files consistent, rolling back incomplete transactions, merging the change buffer, and running purge. Until that finishes, queries are rejected.
Recovery time is proportional to unflushed work at crash time. A small redo log and clean transaction set recovers in seconds; gigabytes of redo, many dirty pages, or a long uncommitted transaction can take minutes or hours. Do not mistake the open TCP port for “MySQL is ready.” Do not restart because queries are slow, and do not confuse a single long recovery with a restart loop.
What this means
InnoDB is a write-ahead logging (WAL) engine. Committed changes go to the redo log first; dirty pages flush to tablespaces later. After a crash, the redo log may contain changes not yet in the data files. On startup, InnoDB automatically discovers tablespaces, applies the redo log, rolls back uncommitted transactions, merges the change buffer, and purges old row versions. See InnoDB recovery.
During redo log application, mysqld may accept TCP connections, but queries are rejected. Uptime counts from process start, so it increases while recovery runs. MySQL 8.0+ accepts connections early after redo application and finishes rollback in the background; until then, the server is not usable. If a tablespace referenced by the redo log is missing, recovery terminates. If the redo log is corrupt or rollback cannot complete, recovery may hang indefinitely.
flowchart TD
A[Unclean shutdown] --> B[Redo log holds unflushed changes]
B --> C[mysqld starts and listens on 3306]
C --> D[InnoDB replays redo log]
D --> E[Queries are rejected]
E --> F{Recovery complete?}
F -->|Yes| G[Queries accepted]
F -->|No| H[Wait, or use force recovery for extraction]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Large redo log after a write burst | Error log shows log scan progressing slowly through many records; recovery takes minutes | Redo log capacity and checkpoint age before the crash |
| Long-running uncommitted transaction at crash | Error log shows “Rolling back trx id” for one transaction with little other progress | INNODB_TRX history or error log for the transaction ID |
| Missing tablespace referenced in redo | Recovery aborts with “cannot find tablespace” | .ibd file exists and innodb_directories covers all paths |
| Corrupt redo log or data page | Repeated errors, assertion failures, or looped “Out of space in the redo log” | Error log for stack traces and corruption messages |
| Restart loop misidentified as slow recovery | Uptime resets every few minutes; error log repeats crash and recovery | Process start time versus Uptime; systemd/kubelet restart events |
Quick checks
Run these read-only checks before making changes.
# mysqld is up, but recovery may still be running
mysqladmin -u monitor -p status
-- Server uptime
SHOW GLOBAL STATUS LIKE 'Uptime';
# Current recovery phase; adjust path to match log_error
tail -f /var/log/mysql/error.log
-- MySQL 8.0.30+: estimate redo log replay scope
SELECT
(SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_redo_log_current_lsn') -
(SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_redo_log_checkpoint_lsn')
AS checkpoint_age;
-- Redo log capacity
SHOW GLOBAL STATUS LIKE 'Innodb_redo_log_capacity_resized';
# Confirm whether queries execute or block
time mysql -u monitor -p -e "SELECT 1;"
# Compare process age to Uptime to detect a restart loop
ps -o pid,etime,cmd -p $(pgrep -x mysqld)
-- MySQL 8.0+: transactions rolling back after startup
SELECT COUNT(*) FROM information_schema.INNODB_TRX WHERE trx_state = 'ROLLING BACK';
How to diagnose it
Confirm the symptom. mysqld is running, TCP port 3306 is open, but
SELECT 1hangs or errors.Uptimeis low and climbing. Do not restart the server; a restart replays redo from the beginning.Read the error log from crash time forward. Look for “InnoDB: Starting crash recovery,” “Doing recovery,” “Rolling back trx id,” or “Applying log.” The log tells you whether InnoDB is making progress. Progress can be slow on large redo logs.
Estimate recovery scope. On MySQL 8.0.30+, compute checkpoint age from the Quick checks and compare it to
Innodb_redo_log_capacity_resized. A checkpoint age close to capacity means large replay. On earlier versions, parseSHOW ENGINE INNODB STATUSfor “Log sequence number” and “Last checkpoint at.” There is no built-in ETA; divide the gap by your storage throughput to approximate minutes, not seconds.Identify long rollback work. If the error log shows one transaction rolling back for a long time, that transaction may have run for hours before the crash. Rollback of a large transaction can take longer than the original transaction. The safe options are to wait, or to start with
innodb_force_recoveryand drop the offending table. Killing mysqld does not speed this up.Distinguish a restart loop. Compare the mysqld process start time from
pswithUptime. IfUptimeresets while the process is unchanged, MySQL is crashing and recovering in a loop. If the process is recreated, check systemd, Kubernetes, or your orchestrator for back-to-back restarts. Restart loops have different root causes: corrupt system tables, full disk, or invalid configuration.Handle stuck or corrupt recovery. If the error log shows no progress for a long period, repeated “Out of space in the redo log” messages (MySQL bug #107594), or corruption errors, use
innodb_force_recovery. The scale runs from 1 to 6; levels 4 through 6 place InnoDB in read-only mode. Use these only to extract data withmysqldumporSELECT, not for normal writes. See Forcing InnoDB Recovery.Check tablespace discovery. Recovery must find every tablespace referenced in the redo log. If a file was moved, deleted, or stored outside
datadir, recovery terminates. Verify thatinnodb_directories,datadir,innodb_data_home_dir, andinnodb_undo_directoryinclude the correct paths.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Uptime | Detects unexpected restarts | Sudden drop or repeated resets within an hour |
| Checkpoint age / redo log capacity | Scope of redo replay after crash | Gap > 50% of capacity before crash, or large absolute gap |
| Error log recovery phase | Shows whether recovery is progressing | Same phase and trx id stuck for many minutes |
Questions rate | Confirms the server is executing queries | Near zero after TCP is listening while connections exist |
Threads_connected | Misleading during recovery | High value before Uptime stabilizes and queries return |
INNODB_TRX rollback count | Reveals long rollback work | Transactions in ROLLING BACK state persisting |
Fixes
Wait for automatic recovery
If the error log shows steady progress, wait. Restarting mysqld aborts current recovery and restarts redo replay from the beginning, making the outage longer. Monitor the error log for “ready for connections.” Do not schedule backups or run FLUSH commands during recovery.
Start with innodb_force_recovery to extract data
If a runaway rollback blocks recovery, add innodb_force_recovery=3 to my.cnf, start, drop the offending table, then remove the setting and restart. This bypasses normal safety checks. Use it only to extract or remove the blocking object, not to resume production writes. Levels 4 through 6 make InnoDB read-only; plan for a full restore from backup afterward.
Restore missing tablespaces
If recovery terminates because a tablespace is missing, restore the missing .ibd file from backup or place it back in the path recorded in innodb_directories. Tablespace discovery must happen in the original location. If the tablespace is unimportant and has no foreign key relationships, you may be able to discard it after starting with innodb_force_recovery, but that risks data loss.
Break a restart loop
A restart loop is not a slow recovery. Identify the crash trigger from the error log just before each restart: disk full, assertion failure, corrupt data dictionary, or invalid system variable. Fix the underlying condition. Do not raise innodb_force_recovery above 3 unless you are extracting data.
Resize the redo log after recovery
If the root cause was an undersized redo log that repeatedly fills, plan a maintenance window to increase innodb_redo_log_capacity on MySQL 8.0.30+, or innodb_log_file_size and innodb_log_files_in_group on older versions. This does not speed up current recovery, but it reduces future replay size.
Never delete redo logs manually
Deleting #innodb_redo files or old ib_logfile* files to skip recovery causes data loss and can make the server unstartable. Remove redo logs only after a clean shutdown with innodb_fast_shutdown=0 or 1, and only when intentionally reinitializing the instance.
Prevention
- Use clean shutdowns:
mysqladmin shutdown, systemd graceful stops, or theSHUTDOWNSQL statement. Avoidkill -9on mysqld. - Before planned maintenance, set
innodb_fast_shutdown=0to perform a full purge and change buffer merge. This shortens recovery on restart at the cost of a slower shutdown. The default of1defers that work. - Size the redo log for your write rate. On MySQL 8.0.30+, use
innodb_redo_log_capacity. On earlier versions, useinnodb_log_file_sizeandinnodb_log_files_in_group. Monitor checkpoint age and keep peak usage below 75% of capacity. - Keep transactions short. Long uncommitted transactions block purge, locks, and create long rollback times after a crash.
- Do not move the data directory after a crash. Ensure
innodb_directoriesand related path variables cover all tablespaces, undo files, and external tables. - Monitor
Uptime, error log recovery keywords, and process restart events to distinguish a long recovery from a restart loop. - Maintain tested backups.
innodb_force_recoveryis an extraction tool, not a repair tool; final recovery usually involves a restore.
How Netdata helps
- Detects unexpected
Uptimedrops and restart-loop patterns. - Tracks
Innodb_redo_log_current_lsn,Innodb_redo_log_checkpoint_lsn, and capacity on MySQL 8.0.30+, showing replay scope. - Monitors
Questionsrate andThreads_connectedto expose when the server accepts TCP connections but does not execute queries. - Surfaces MySQL error log anomalies, including recovery phase keywords.
- Correlates MySQL signals with OS metrics such as disk I/O, memory pressure, and OOM kills to identify the original cause of the unclean shutdown.
Related guides
- How MySQL actually works in production: a mental model for operators
- MySQL Aborted_connects and Aborted_clients climbing: diagnosis
- MySQL adaptive hash index latch contention: high CPU, low throughput
- MySQL binary logs filling the disk: expiry, lagging replicas, and purge
- MySQL InnoDB buffer pool hit ratio collapse: the cliff edge
- MySQL slow after restart: buffer pool warm-up and the cold cache
- MySQL innodb_buffer_pool_size tuning: 60-80% of RAM and when that breaks
- MySQL Innodb_buffer_pool_wait_free > 0: buffer pool memory pressure
- MySQL InnoDB checkpoint age: the redo log capacity signal nobody watches
- MySQL connection exhaustion: detection, diagnosis, and prevention
- MySQL innodb_deadlock_detect=OFF: when deadlock detection becomes the bottleneck
- MySQL ERROR 1213: Deadlock found when trying to get lock; try restarting transaction







