ClickHouse query error rate high: reading exception codes 241, 252, and 999

What this means

ClickHouse logs failed queries to system.query_log with type = ‘ExceptionWhileProcessing’ and a numeric exception_code. The FailedQuery counter in system.events increments for each failure. Keep the FailedQuery / Query ratio below 1%. A sustained climb above that threshold signals a systemic issue, not a few bad queries.

The codes that dominate production incidents fall into five classes.

  • 241 MEMORY_LIMIT_EXCEEDED
  • 252 TOO_MANY_PARTS
  • 159 read timeout on distributed queries (shard unavailable)
  • 999 / 210 ZooKeeper or network session errors
  • 60 / 81 schema drift or missing objects
flowchart TD
    A[Query error rate above 1 percent] --> B[Query system.query_log by exception_code]
    B --> C{Code 241}
    B --> D{Code 252}
    B --> E{Code 159}
    B --> F{Code 999 or 210}
    B --> G{Code 60 or 81}
    C --> H[Memory pressure or runaway query]
    D --> I[Merge crisis or too many parts]
    E --> J[Shard unavailable or network timeout]
    F --> K[ZooKeeper session loss or coordination failure]
    G --> L[Schema drift or missing table]

Common causes

CauseWhat it looks likeFirst thing to check
Runaway query or cache evictionCode 241 spike; MemoryTracking near server limitsystem.processes ordered by memory_usage
Merge debt or insert micro-batchingCode 252 spike; insert latency climbs before rejectionssystem.parts per partition; system.events RejectedInserts
Shard unavailable or slow replicaCode 159 spike on distributed tablessystem.replicas absolute_delay and queue_size on target shards
ZooKeeper / Keeper session lossCodes 999/210 spike; replicated tables become readonlysystem.zookeeper_connection is_expired; system.replicas is_session_expired
Schema drift after deployCodes 60/81 spike following a migrationsystem.distributed_ddl_queue status; query text in system.query_log

Quick checks

Run these read-only queries to classify the failure.

Failed versus total query ratio:

SELECT event, value
FROM system.events
WHERE event IN ('FailedQuery', 'FailedSelectQuery', 'FailedInsertQuery',
                'Query', 'SelectQuery', 'InsertQuery');

Compare FailedQuery to Query. If the ratio is above 0.01 for more than five minutes, investigate.

Exception codes during the incident window:

SELECT exception_code,
       count() AS count,
       any(exception) AS sample_exception
FROM system.query_log
WHERE type = 'ExceptionWhileProcessing'
  AND event_time > now() - INTERVAL 10 MINUTE
GROUP BY exception_code
ORDER BY count DESC;

Top memory consumers right now:

SELECT query_id, user, memory_usage, peak_memory_usage, elapsed,
       substring(query, 1, 100) AS query_prefix
FROM system.processes
WHERE memory_usage > 0
ORDER BY memory_usage DESC
LIMIT 10;

Part count per partition:

SELECT database, table, partition_id, count() AS active_parts
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
ORDER BY active_parts DESC
LIMIT 20;

Replica and coordination health:

# Replicas that cannot accept writes
SELECT database, table, is_readonly, is_session_expired, queue_size
FROM system.replicas
WHERE is_readonly = 1 OR is_session_expired = 1;

# ZooKeeper / Keeper connection state
SELECT name, host, port, is_expired, session_uptime_elapsed_seconds
FROM system.zookeeper_connection;

Stuck distributed DDL:

SELECT entry, host_name, status, exception_text
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY entry DESC
LIMIT 10;

How to diagnose it

  1. Confirm the error rate. Query system.events for FailedQuery and Query. Calculate the ratio over a five-minute window. Exclude monitoring probes by filtering out known health-check users or query patterns.
  2. Classify by exception code. Query system.query_log for ExceptionWhileProcessing grouped by exception_code over the incident window.
  3. For code 241 (MEMORY_LIMIT_EXCEEDED): Check system.processes for queries with high peak_memory_usage. Check MemoryTracking in system.metrics against max_server_memory_usage. Look for spill-to-disk activity in the server tmp directory.
  4. For code 252 (TOO_MANY_PARTS): Check active part count per partition in system.parts. Check RejectedInserts and DelayedInserts in system.events. Verify that system.merges shows merges running; if not, the pool may be blocked by mutations or I/O starvation.
  5. For code 159 (read timeout): Check system.replicas on each shard for absolute_delay and queue_size. A replica with high lag or a stuck replication_queue entry slows distributed queries. Verify inter-server network latency between the coordinator and target shards.
  6. For codes 999/210 (ZooKeeper / network): Check system.zookeeper_connection for is_expired. Check system.replicas for is_session_expired. Test coordination service responsiveness with echo ruok | nc zk-host 2181 and echo mntr | nc zk-host 2181.
  7. For codes 60/81 (schema drift): Check system.distributed_ddl_queue for entries with status != ‘Finished’. Compare CREATE TABLE statements across shards. Review recent migrations or ALTER operations.
  8. Correlate with deployment times. A code change or config rollout preceding the spike suggests schema or query plan regression.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
FailedQuery / Query ratioOverall query health> 1% sustained for five minutes
exception_code distributionIdentifies the failing domainAny code class spiking above baseline
MemoryTrackingMemory pressure precursor> 80% of max_server_memory_usage
MaxPartCountForPartitionDistance from too-many-parts cliff> 50% of your configured parts_to_throw_insert
RejectedInsertsHard insert failureAny non-zero sustained rate
absolute_delayReplica data freshness> 120 seconds sustained
ZooKeeper latencyCoordination health> 10 ms p99 or session expiry
system.distributed_ddl_queue statusSchema propagationAny non-Finished entry persisting > five minutes

Fixes

Group your response by the exception code class dominating the failure.

Memory limit exceeded (241)

Identify the heaviest consumer in system.processes and kill it:

Warning: KILL QUERY aborts running work and returns an error to the client.

KILL QUERY WHERE query_id = '...';

Set or lower per-query memory limits via the user profile max_memory_usage. Enable max_bytes_before_external_group_by to spill heavy GROUP BY operations to disk. If the load is concentrated under specific users, restrict their concurrency via profile limits.

Tradeoff: Spilling to disk saves queries from being killed, but it increases disk I/O load and query latency.

Too many parts (252)

Pause or throttle insert sources immediately. Increase batch sizes to at least 1000 rows per INSERT. Check system.mutations for blocking background work and kill non-critical mutations if they are consuming merge threads.

Warning: KILL MUTATION is disruptive and can leave partial changes in place until the next merge.

Review whether async_insert is appropriate for your write pattern.

Tradeoff: Pausing inserts stops data loss but creates upstream backlog. Recovery can take minutes to hours as merges catch up.

Shard unavailable / timeout (159)

Identify the slow shard by checking system.replicas on each node for replication lag and queue depth. If a specific replica is lagging, route traffic away from it temporarily if your client supports it. For stuck replication entries, SYSTEM RESTART REPLICA can force re-evaluation.

Warning: SYSTEM RESTART REPLICA is disruptive and interrupts replication on that table.

Tradeoff: Rerouting reads reduces load on the struggling shard but concentrates it elsewhere.

ZooKeeper / network (999/210)

Pause all DDL operations immediately to reduce coordination load. Check ensemble health with ruok and mntr. Do NOT restart ClickHouse nodes during a coordination overload; reconnection storms worsen the problem. If using JVM-based ZooKeeper, check GC logs and disk latency for the transaction log. ClickHouse Keeper does not have GC pauses, but disk I/O saturation still hurts it.

Tradeoff: Halting DDL buys recovery time but blocks schema changes.

Schema drift (60/81)

Review system.distributed_ddl_queue for exceptions tied to specific hosts. Align CREATE TABLE statements across shards, or re-run the migration. If one replica missed a DDL entry, apply the change locally.

Tradeoff: Fixing drift may require brief table unavailability on affected shards.

Prevention

  • Monitor active part count per partition, not just per table. Alert when any partition crosses 50% of your configured parts_to_delay_insert threshold.
  • Track the merge-to-insert ratio. If part creation rate exceeds merge completion rate for more than 30 minutes, investigate before RejectedInserts appears.
  • Cap per-query memory with max_memory_usage and set max_server_memory_usage leaving at least 20% headroom.
  • Monitor ZooKeeper or Keeper latency as a first-class signal, not just process liveness.
  • Exclude monitoring probes from error-rate calculations by filtering on user or query text pattern.
  • Validate schema changes with distributed DDL and verify completion on every node before declaring a migration done.

How Netdata helps

  • Correlates FailedQuery rate with MemoryTracking, active parts, and background pool utilization.
  • Surfaces exception_code spikes from system.query_log without manual polling.
  • Alerts on MaxPartCountForPartition and RejectedInserts before code 252 errors flood clients.
  • Tracks ZooKeeper session state and replication lag alongside query errors to distinguish code 999 from application issues.
  • Excludes internal health-check traffic from error-rate calculations via user or query pattern filtering.