How Kafka actually works in production: a mental model for operators

What it is and why it matters

Each broker is a concurrent system of subsystems sharing disk, network, page cache, and threads. Tail latency spikes and offline partitions usually stem from interactions between these subsystems, not single component failures.

Underneath the distributed commit log abstraction, each broker runs a reactor-pattern server with three saturation layers: network threads, a bounded request queue, and I/O handler threads. Between I/O threads and the log sits a delayed-operation timer wheel called the purgatory. A separate controller maintains cluster metadata in a single-threaded event queue. Consumer groups are managed by a Group Coordinator that hashes group IDs to the __consumer_offsets topic. Log data lives in segment files that rely on OS page cache and zero-copy sendfile. These subsystems compete for the same disk I/O, network bandwidth, file descriptors, and cache. Treating them as independent pipelines leads to unnecessary restarts and missed bottlenecks.

How it works

flowchart LR
    Producer -- produce --> NetworkThreads
    Consumer -- fetch --> NetworkThreads
    NetworkThreads -- enqueue --> RequestQueue
    RequestQueue -- dequeue --> IOThreads
    IOThreads -- append --> LogSegments
    IOThreads -- delay --> Purgatory
    Purgatory -- complete --> IOThreads
    LogSegments -- read --> PageCache
    PageCache -- sendfile --> NetworkThreads
    Controller -- metadata --> EventQueue

The broker request path follows a reactor pattern. A small pool of network threads (num.network.threads, default 3) accepts connections, reads requests, and writes responses. Requests land in a bounded request queue (queued.max.requests, default 500). I/O handler threads (num.io.threads, default 8) pull from this queue and do the work: appending to logs, reading segments, or updating metadata. Kafka can saturate at any of these three layers independently.

For a produce request, the I/O thread appends the batch to the active log segment. The write goes to the OS page cache, not necessarily to disk immediately. Acknowledgment depends on the producer’s acks setting. With acks=all, the request enters purgatory, a timer wheel that holds it until all ISR members acknowledge. If followers are slow, produce requests accumulate in purgatory and producer latency rises as RemoteTimeMs.

For a fetch request, consumers and follower replicas use the same fetch path. The broker resolves the offset, reads from the log segment, and returns batches. When the segment is hot in page cache, the broker serves it via zero-copy sendfile. If the data is not in cache, the read hits disk and LocalTimeMs spikes. Consumer fetches and replica fetches compete for the same I/O and network resources. Fetch requests also enter purgatory when fetch.min.bytes is not satisfied, which adds wait time to LocalTimeMs.

Replication is follower-driven. Each partition has one leader and N-1 followers determined by replication.factor. Followers run fetcher threads that pull data from the leader. Each follower runs a ReplicaFetcherThread per leader broker, so follower load scales with the number of leader brokers it replicates from, not just partition count. The leader maintains the In-Sync Replica set (ISR). Followers remain in the ISR if they keep up within replica.lag.time.max.ms (default 30 seconds). When a follower falls behind, it is removed from ISR. If enough followers leave, min.insync.replicas may not be satisfied, causing acks=all producers to receive NotEnoughReplicasException.

The controller handles partition leadership assignment, ISR changes, topic creation, and broker lifecycle. It processes events sequentially from a queue. Queue depth is a critical health indicator. In large clusters, a broker failure generates thousands of leadership events; if the controller queue backs up, metadata propagation stalls and clients see NOT_LEADER_FOR_PARTITION. Only one broker is the controller at a time; on failure, another broker takes over after an epoch negotiation via ZooKeeper or KRaft.

The Group Coordinator manages consumer groups. One broker acts as coordinator for each group, determined by hashing the group ID to the __consumer_offsets partition leader. It handles joins, heartbeats, rebalances, and offset commits. Rebalances pause consumption and are a major source of client-visible disruption.

Log segments organize each partition as a directory of segment files (default 1 GiB). The active segment receives writes. Older segments are eligible for retention or compaction. Each segment has an offset index and a time index; if these are evicted from page cache, offset lookups hit disk and tail latency spikes even for cached data. A periodic task checks for expiration every log.retention.check.interval.ms (default 5 minutes). For compacted topics, log cleaner threads run in the background. If cleaner threads cannot keep up, disk usage grows silently.

Where it shows up in production

The most common degradation is ISR shrink. A follower’s disk slows or a network blip occurs. The leader removes it from ISR. If a second broker hiccups before the first recovers, min.insync.replicas cannot be met and writes are rejected. This is the cascade pattern: one sick broker creates a durability window that turns a minor second event into an outage.

Page cache eviction causes a latency cliff that looks like disk failure. A bulk consumer reading historical data evicts hot segments from the OS page cache. Tail consumers now hit disk. FetchConsumer LocalTimeMs jumps from sub-millisecond to tens of milliseconds. The disk itself is healthy; the cache is cold. A similar symptom appears when a local disk actually slows: I/O threads stall on await for one log directory, but the delay propagates to unrelated partitions sharing those threads. Operators who chase disk replacements here waste days.

Request queue saturation creates a positive feedback loop. Network threads accept requests faster than I/O threads process them. The queue fills. Network threads block. Clients time out and retry, adding more load. RequestHandlerAvgIdlePercent drops below 0.3, then 0.1. The broker is now in active overload even though the underlying disk may be idle. Adding more I/O threads does not relieve disk saturation; it increases context switching while await remains high.

Consumer rebalance storms look like broker problems but are client-side. A slow consumer misses max.poll.interval.ms. It gets evicted. The group rebalances. Consumption pauses. Other consumers now have more work, miss their deadlines, and also get evicted. The group oscillates between PreparingRebalance and Stable without settling. Lag grows while broker metrics look normal.

Controller queue backup during large-scale failures is especially dangerous. If multiple brokers fail, the controller must elect new leaders for thousands of partitions. The event queue grows continuously. Partitions are offline, but OfflinePartitionsCount may under-report because the controller has not yet processed the events. Restarting more brokers adds more events and deepens the crisis.

Tradeoffs and common misuses

Zero-copy sendfile only works when consumer and broker compression codecs match and no message format down-conversion occurs. If an old consumer joins, the broker decompresses and recompresses on the JVM heap, burning CPU and memory. Expose this with MessageConversionsPerSec.

Page cache reliance means Kafka is stateful in practice. After a broker restart, the page cache is empty. Expect elevated LocalTimeMs and disk read I/O for 10-60 minutes until the working set warms. Treating the broker as stateless and restarting it aggressively prolongs the pain. Kafka does not fsync every write by default; it relies on replication to multiple page caches for durability, not disk persistence.

The ISR contract trades availability for durability. With unclean.leader.election.enable=false (default since 0.11.0.0), a partition with no ISR members goes offline. With enable=true, a behind follower becomes leader and acknowledged data is silently lost. There is no free lunch; monitor UncleanLeaderElectionsPerSec even if it is always zero.

Purgatory size is meaningless if all producers use acks=1 or acks=0. Teams monitor produce purgatory without checking producer configurations, then waste time investigating replication lag that does not affect their workload.

Signals to watch in production

SignalWhy it mattersWarning sign
RequestHandlerAvgIdlePercentBest single indicator of broker processing capacitySustained below 0.3; active overload below 0.1
RequestQueueSizePressure gauge between network and I/O threadsConsistently above 50% of queued.max.requests (default 500)
UnderReplicatedPartitionsDurability guarantee is degraded; window for data loss is openNonzero for more than 5 minutes outside maintenance
IsrShrinksPerSecVelocity of replicas falling out of syncSustained above zero outside maintenance windows
ControllerEventQueueSizeMetadata plane backlog from sequential controller processingGrowing continuously without draining
FetchConsumer LocalTimeMsPage cache effectiveness for consumer readsSpikes from near-zero to disk latency values
PurgatorySize (Produce)acks=all requests stalled waiting for replicationGreater than 2x baseline sustained for more than 5 minutes
NetworkProcessorAvgIdlePercentNetwork thread saturation affecting all client connectivitySustained below 0.3

How Netdata helps

Netdata correlates Kafka JMX metrics with OS-level metrics such as disk await and pgmajfault on the same charts to distinguish disk failure from page cache eviction.

It surfaces the produce request latency breakdown (Queue, Local, Remote, Response) to identify whether the bottleneck is disk, replication, or thread saturation without manual JMX queries.

It tracks consumer lag and group state transitions alongside broker BytesOutPerSec to determine if lag is caused by a slow consumer or a broker read-path bottleneck.

It monitors the controller event queue and ISR shrink rates together to warn of metadata-plane pressure before partitions go offline.

It alerts on combinations such as UnderReplicatedPartitions rising while RequestHandlerAvgIdlePercent drops, reducing false positives from single-metric noise.