Kafka RecordTooLargeException / MESSAGE_TOO_LARGE: message size limits across the path
Two errors surface: the producer client throws RecordTooLargeException before the request hits the wire, or the broker returns MESSAGE_TOO_LARGE in the produce response. Both mean a record batch exceeds a limit somewhere in the path, but the fix depends on exactly which limit and where the size is measured. These settings are not a single knob: max.request.size, message.max.bytes, max.message.bytes, replica.fetch.max.bytes, fetch.max.bytes, max.partition.fetch.bytes, and socket.request.max.bytes must all align. Raise one limit without raising the others and you will wedge replication, strand consumers, or silently lose data on the next large message.
What this means
Kafka enforces size limits at multiple points between producer and consumer. The producer’s max.request.size caps the full produce request, including the compressed batch. The broker’s message.max.bytes is a global ceiling on the compressed record batch. A topic-level max.message.bytes overrides the broker default per topic, so the effective ceiling is the topic value if set; otherwise it falls back to the broker value. The broker also enforces socket.request.max.bytes as a hard socket-level ceiling. If a request exceeds that, the connection is closed instead of returning a graceful error.
On the read side, consumers use fetch.max.bytes and max.partition.fetch.bytes to limit how much data they receive in one fetch. Follower replicas use replica.fetch.max.bytes. Kafka returns the first record batch even when it exceeds a fetch limit, but a limit that is too small creates a permanent replication or consumption floor that shows up as under-replication or consumer lag.
Compression applies to the whole batch, not individual records. A compressed batch can be much smaller than the sum of its records, but the producer, broker, and consumer see different sizes depending on where they measure. A batch that is small when compressed can still exceed a limit if the producer uses no compression or if linger.ms and batch.size combine into an oversized batch.
flowchart LR
A[Producer batch
max.request.size] -->|produce request| B[Broker socket
socket.request.max.bytes]
B -->|validate batch| C[Broker/topic record batch limit
message.max.bytes / max.message.bytes]
C -->|accept to| D[Leader log]
D -->|replicate| E[Replica fetch
replica.fetch.max.bytes]
D -->|serve| F[Consumer fetch
fetch.max.bytes /
max.partition.fetch.bytes]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Producer batch exceeds broker or topic limit | MESSAGE_TOO_LARGE in broker logs; Java producers on affected versions can enter a split-retry loop (KAFKA-8350) | Compare producer max.request.size against broker message.max.bytes and topic max.message.bytes |
Topic max.message.bytes is lower than the broker limit and the producer | MESSAGE_TOO_LARGE only on one topic | kafka-configs.sh --describe --entity-type topics --entity-name <topic> |
replica.fetch.max.bytes is smaller than the effective message limit | Under-replicated partitions and ISR shrinks after large messages start arriving | grep replica.fetch.max.bytes /path/to/server.properties on every broker |
| Consumer fetch limit is smaller than the message size | Consumer stalls or lag grows after a producer change; no producer errors | Consumer config for fetch.max.bytes and max.partition.fetch.bytes |
| Compression is disabled or a single batch is aggregated too large | Batch size jumps after a producer config change or upgrade to Kafka 4.0 | Producer compression.type, batch.size, and linger.ms |
| Managed Kafka hard ceiling | Errors persist even after all cluster configs are raised | Cloud provider limit for the instance tier |
| Client library or connector with a hardcoded limit | RecordTooLargeException from a client that ignores broker config | Client source or connector docs, e.g. Trino Kafka Event Listener, MirrorMaker 2, OpenTelemetry exporter |
Quick checks
Run these read-only checks before changing any config.
# Find the exact error in broker logs
grep -E "MESSAGE_TOO_LARGE|RecordTooLargeException" /var/log/kafka/server.log
# Inspect the topic-level message size limit
kafka-configs.sh --bootstrap-server localhost:9092 --describe \
--entity-type topics --entity-name <topic>
# Inspect broker-level limits in server.properties
grep -E "message.max.bytes|replica.fetch.max.bytes|socket.request.max.bytes" \
/etc/kafka/server.properties
# List under-replicated partitions to spot replica fetch issues
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Check consumer lag if reads are stalling
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>
If you have JMX access, pull the produce error rate directly:
# Failed produce requests per second
echo "get -b kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec OneMinuteRate" \
| java -jar jmxterm.jar -l localhost:9999
How to diagnose it
- Identify which side rejected the batch.
RecordTooLargeExceptionin the producer client means the producer refused to send it.MESSAGE_TOO_LARGEin broker logs or the producer response means the broker refused it. MirrorMaker 2 and some connectors can shift the error from source to target. - Read the actual batch size. Producer logs often include the size of the record batch that failed. If not, derive the approximate uncompressed size from
BytesInPerSec / MessagesInPerSecand factor in compression. - Find the effective broker/topic ceiling. The effective limit is the topic
max.message.bytesif configured; otherwise it falls back to the brokermessage.max.bytes. Check both. Do not assume the global default is the active limit. - Check the socket-level ceiling. If the whole produce request exceeds
socket.request.max.bytes, the connection is closed. This looks like a network error, not a clearMESSAGE_TOO_LARGE. The default is 100 MiB. - Verify replication can pull the batch. On every broker,
replica.fetch.max.bytesmust be at least as large as the effective message limit. If it is smaller, followers cannot replicate large messages and the ISR shrinks. - Verify consumers can fetch the batch. Consumers need
fetch.max.bytesandmax.partition.fetch.byteslarge enough for the batch. Forward-progress logic means the first batch is still returned, but a too-small limit causes inefficient fetches and lag. - Review producer batching and compression. Check
compression.type,batch.size, andlinger.ms. Ifbatch.sizeexceeds the broker’s effectivemessage.max.bytes, the producer can enter an infinite split-and-retry loop on versions affected by KAFKA-8350 . - Check for silent data loss if you are on an affected version. KAFKA-19479 caused Kafka Streams with
processing.guarantee=at_least_onceto commit offsets while dropping messages that hitMESSAGE_TOO_LARGE. If you see committed offsets advancing without corresponding downstream output, patch the broker and client. - If you use managed Kafka, confirm the provider ceiling. Some tiers have non-configurable limits. Raising
message.max.bytesinside the cluster will not override the provider’s hard cap.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
FailedProduceRequestsPerSec | Direct measure of producer-visible errors including MESSAGE_TOO_LARGE | Sustained OneMinuteRate above 0 |
BytesInPerSec / MessagesInPerSec | Gives average message/batch size; sudden jump indicates a producer behavior change | Ratio increases sharply after a deployment |
UnderReplicatedPartitions | Shows whether large messages are replicating; rises when replica.fetch.max.bytes is too small | Nonzero outside maintenance windows |
IsrShrinksPerSec | Velocity of replicas falling out of sync | Sustained shrinks after large-message traffic starts |
FetchFollower TotalTimeMs | Replication path latency; high values mean followers are being served slowly | p99 approaching replica.lag.time.max.ms |
| Consumer lag | Confirms consumers can read the large batches; lag grows if fetch limits are too tight or processing slows | Lag increasing monotonically |
RequestQueueTimeMs / LocalTimeMs | Distinguishes broker overload from disk latency if large messages saturate I/O threads | p99 spikes in either component |
Page cache pressure (pgmajfault rate) | Large messages can evict hot records and shift reads to disk | Major fault rate doubles from baseline |
Fixes
If the broker or topic limit is too low
Raise the topic limit first. This avoids a global change and a rolling restart:
kafka-configs.sh --bootstrap-server localhost:9092 --alter \
--entity-type topics --entity-name <topic> \
--add-config max.message.bytes=<new-limit>
If many topics need the same limit, raise message.max.bytes in server.properties on every broker and perform a rolling restart. Set replica.fetch.max.bytes to at least the effective message limit on every broker. If you raise message.max.bytes to 8 MiB, set replica.fetch.max.bytes to 8 MiB or higher everywhere.
If the producer is creating oversized batches
Increase max.request.size on the producer to match the broker limit, but also tune batch.size and linger.ms so batches do not grow beyond the broker’s ability to split them. Keep batch.size below the broker’s effective message.max.bytes to avoid KAFKA-8350 on affected versions. Enable compression unless your workload is already CPU-bound; compression.type=lz4 or zstd reduces batch size dramatically.
If consumers cannot fetch
Increase fetch.max.bytes and max.partition.fetch.bytes on the consumer. You do not need to match the producer limit exactly, but the per-partition limit must be larger than the largest batch you expect.
If the managed service caps you
You cannot override the provider’s ceiling. Options: split large payloads across multiple records, store the payload in object storage and pass a reference through Kafka, or move to a tier with a higher limit.
If a client library hardcodes its own limit
Some clients and connectors ignore the broker config. For MirrorMaker 2, use producer.override.max.request.size in the connector config. For other clients, check the client source or docs for an explicit cap and patch or configure accordingly.
Prevention
- Document the end-to-end size limit chain for every cluster: producer
max.request.size, brokermessage.max.bytes, topicmax.message.bytes, brokerreplica.fetch.max.bytes, consumerfetch.max.bytes/max.partition.fetch.bytes, and brokersocket.request.max.bytes. - After raising any message limit, verify
replica.fetch.max.bytesis at least as large on every broker. This is the most commonly forgotten step. - Alert on
FailedProduceRequestsPerSecandUnderReplicatedPartitionsso size-related regressions do not go unnoticed. - Track
BytesInPerSec / MessagesInPerSecto detect producer batching or message-size changes early. - Test large-message workloads in staging with the same compression and
linger.mssettings used in production. - Keep brokers patched to avoid KAFKA-8350 and KAFKA-19479 if your versions are affected.
- Do not enable
unclean.leader.election.enableto work around replication issues caused by large messages. That trades a size config problem for confirmed data loss.
How Netdata helps
Netdata surfaces the broker JMX metrics that expose message-size failures and their side effects:
FailedProduceRequestsPerSecshows when producer-visible errors start.- Overlay
UnderReplicatedPartitionsandIsrShrinksPerSecto detect replica-fetch limits that are too small. - Correlate
FetchFollowerlatency withRequestQueueTimeMsand diskawaitto determine whether large messages are saturating I/O threads or evicting page cache. - Track consumer lag and per-topic
BytesInPerSec/MessagesInPerSecto spot the average batch-size increase that precedes a limit breach. - Use OS-level metrics such as
pgmajfaultrate and disk latency to catch the secondary latency cliff that large messages can cause.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka enable.auto.commit data loss: committed offsets that outrun processing
- Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka __consumer_offsets growing huge: compaction failure on the offsets topic
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka disk I/O latency high: await, LocalTimeMs, and the slow-disk broker
- Kafka disk space planning: retention, replication, and runway estimation







