ClickHouse Too many simultaneous queries: max_concurrent_queries and query storms
You run a query and ClickHouse returns “Too many simultaneous queries.” New connections either queue or fail outright. Queries that completed in seconds yesterday now time out. The server is not down, but it is not usable.
ClickHouse is optimized for fewer, heavier analytical queries. As concurrency rises, CPU, memory, and I/O contention increase non-linearly. A small spike can become a storm because each query is greedy.
The hard ceiling is max_concurrent_queries. The code default is 0 (unlimited), and the shipped configuration may override this. Production deployments often set it to 100. Once the running query count hits that limit, new queries are rejected or queued. The most common triggers are client retry amplification, a dashboard “refresh all” burst, or a runaway batch job opening many parallel connections.
flowchart TD
A[Dashboard refresh-all or batch job] --> B[Concurrent query count spikes]
B --> C[Slots fill toward max_concurrent_queries]
C --> D[ClickHouse degrades non-linearly]
D --> E[Latency rises sharply]
E --> F[Client retries amplify load]
F --> CWhat this means
When concurrent queries reach max_concurrent_queries, ClickHouse stops accepting new work. Depending on configuration and client protocol, new queries may queue briefly or fail immediately with the “Too many simultaneous queries” error.
Each query can consume multiple threads, large memory buffers, and significant disk I/O. Unlike OLTP databases that handle hundreds of lightweight transactions, ClickHouse degrades non-linearly as concurrency increases. A query that uses one slot and 2 GB at low load may hold that slot ten times longer under contention, turning a temporary spike into a sustained pile-up.
Query storms create feedback loops. A client receives an error, retries immediately, and now even more queries compete for the same slots. Background merges and replication fetches share the same CPU, memory, and I/O pools, so query saturation starves the storage engine and worsens the spiral.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Client retry amplification | Identical queries resubmitted rapidly from the same user after rejections | system.processes for repeating query patterns and user |
| Dashboard refresh-all | Sudden burst of SELECTs from BI tools or monitoring dashboards | system.processes filtered by client_hostname and user |
| Runaway batch job | ETL or analytics job opens many parallel connections | system.processes for long-running queries from batch service accounts |
| Connection pool overshoot | Client pool size exceeds max_concurrent_queries, causing systematic collisions | Client-side pool configuration versus the server limit |
| Latency pile-up | Slow queries hold slots longer, reducing effective throughput and causing backpressure | system.query_log for rising query_duration_ms at constant concurrency |
Quick checks
Run these read-only checks to assess the current concurrency state.
-- Check running and preempted queries against the limit
SELECT metric, value
FROM system.metrics
WHERE metric IN ('Query', 'QueryPreempted');
-- Inspect live queries to find the heaviest consumers
SELECT
query_id,
user,
client_hostname,
elapsed,
formatReadableSize(memory_usage) AS mem,
substring(query, 1, 200) AS query_prefix
FROM system.processes
ORDER BY elapsed DESC
LIMIT 20;
-- Check cumulative failed query counters
SELECT event, value
FROM system.events
WHERE event IN ('FailedQuery', 'FailedSelectQuery', 'FailedInsertQuery');
-- Measure recent tail latency from finished queries
SELECT
quantile(0.99)(query_duration_ms / 1000) AS p99_sec,
count() AS query_count
FROM system.query_log
WHERE type = 'QueryFinish'
AND is_initial_query = 1
AND event_time > now() - INTERVAL 10 MINUTE;
-- Check server-level memory pressure
SELECT metric, value, formatReadableSize(value) AS readable
FROM system.metrics
WHERE metric = 'MemoryTracking';
-- Count active query execution threads
SELECT value FROM system.metrics WHERE metric = 'QueryThread';
# Check OS-level memory to catch untracked RSS pressure
pid=$(pidof clickhouse-server) && grep -E 'VmRSS|VmSize' /proc/$pid/status
How to diagnose it
Confirm you are at the ceiling. Compare the
Querymetric fromsystem.metricsagainstmax_concurrent_queries. If they are close, you are at the hard limit.Identify who is consuming slots. Query
system.processesordered byelapsedormemory_usage. Look for clusters from the sameuser,client_hostname, or with similarqueryprefixes. This reveals whether the load is legitimate traffic or a misbehaving client.Check for immediate query failures. Look at
system.eventsforFailedQuery. If the counter is increasing while concurrency is pinned at the limit, the server is actively rejecting work.Correlate with resource saturation. Check
MemoryTrackinginsystem.metrics. If it is approachingmax_server_memory_usage, the system is killing or throttling queries, increasing slot hold time and worsening the storm. CheckQueryThreadto see if execution threads are saturated.Determine if retry amplification is occurring. In
system.query_log, look for bursts of identical queries withtype = 'ExceptionWhileProcessing'followed by rapid re-execution from the same client host. This pattern confirms a retry loop.Review latency trends. Query
system.query_logfor P99query_duration_msover the last hour. If latency is rising faster than the concurrency increase, ClickHouse has entered the non-linear degradation zone.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Query count from system.metrics | Active concurrency versus the hard ceiling | Sustained value near max_concurrent_queries |
QueryPreempted from system.metrics | Queries waiting for resources | Non-zero value indicates queuing pressure |
FailedQuery rate from system.events | Direct measure of rejections | Sudden spike correlating with concurrency peaks |
Query latency P99 from system.query_log | Non-linear degradation under load | P99 rising faster than the concurrency increase |
MemoryTracking from system.metrics | Memory pressure from concurrent heavy queries | Value approaching max_server_memory_usage |
QueryThread from system.metrics | Active execution threads | Sustained high count indicating CPU saturation |
Insert latency from system.query_log | Write pipeline pressure from concurrency contention | Rising insert times before hard rejections appear |
Fixes
Immediate relief
Kill the heaviest running queries to free slots. Use query_id from system.processes.
-- WARNING: Killing queries is disruptive to the target workload.
KILL QUERY WHERE query_id = '...';
If a specific batch job or dashboard user is responsible, pause or disable that client before the retry loop restarts.
Client retry amplification
Add exponential backoff to client retry logic. Immediate retries against a saturated server compound the problem. If you control the client, reduce its connection pool size and keep it well below max_concurrent_queries.
Dashboard and BI query bursts
Stagger refresh intervals across panels. Route dashboard reads to pre-aggregated tables where possible, or increase client-side cache TTLs to avoid hammering the same expensive queries simultaneously.
Raise the limit (with caution)
If the ceiling is genuinely too low, increase max_concurrent_queries. Only do this if CPU, memory, and I/O metrics show headroom. Raising the limit without headroom pushes ClickHouse deeper into non-linear degradation, turning a query storm into an OOM kill or memory pressure death spiral.
Reduce per-query resource consumption
Set max_memory_usage per user or profile to prevent individual queries from monopolizing memory and holding slots indefinitely. For large aggregations, enable spill-to-disk with max_bytes_before_external_group_by and max_bytes_before_external_sort; this trades memory for I/O and can prevent slot hoarding.
Prevention
- Size connection pools below the ceiling. Ensure all client connection pools sum to less than
max_concurrent_queries. - Monitor P99 latency as an early warning. Rising latency precedes hard rejections by minutes to hours.
- Set per-query timeouts and memory limits. Prevent a single heavy query from occupying a slot indefinitely.
- Alert on sustained high
Querycount. A threshold at 80% ofmax_concurrent_queriesgives you time to react before failure. - Review batch job scheduling. Separate large ETL windows from peak BI dashboard hours.
How Netdata helps
- Correlates
Querycount with host CPU, memory, and disk latency to confirm whether concurrency is the bottleneck. - Alerts on sustained query counts approaching
max_concurrent_queriesbefore clients see rejections. - Tracks
FailedQueryrate spikes to detect the onset of query storms. - Visualizes query latency P99 degradation, revealing non-linear saturation before the hard limit is reached.
- Surfaces heavy queries by user and host during storms using
system.processesdimensions.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse mark cache and uncompressed cache: reading low hit rates
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse detached parts piling up: reading system.detached_parts and reclaiming space
- ClickHouse disk space collapse: why merges need free space and how the spiral starts
- ClickHouse disk space monitoring: free_space, unreserved_space, and the 80% target
- ClickHouse distributed DDL stuck: ON CLUSTER queries that never finish
- ClickHouse distributed query amplification: one coordinator, many shard subqueries
- ClickHouse full table scan: partition pruning failures and the primary key
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble







