Kafka RecordTooLargeException / MESSAGE_TOO_LARGE: message size limits across the path

Two errors surface: the producer client throws RecordTooLargeException before the request hits the wire, or the broker returns MESSAGE_TOO_LARGE in the produce response. Both mean a record batch exceeds a limit somewhere in the path, but the fix depends on exactly which limit and where the size is measured. These settings are not a single knob: max.request.size, message.max.bytes, max.message.bytes, replica.fetch.max.bytes, fetch.max.bytes, max.partition.fetch.bytes, and socket.request.max.bytes must all align. Raise one limit without raising the others and you will wedge replication, strand consumers, or silently lose data on the next large message.

What this means

Kafka enforces size limits at multiple points between producer and consumer. The producer’s max.request.size caps the full produce request, including the compressed batch. The broker’s message.max.bytes is a global ceiling on the compressed record batch. A topic-level max.message.bytes overrides the broker default per topic, so the effective ceiling is the topic value if set; otherwise it falls back to the broker value. The broker also enforces socket.request.max.bytes as a hard socket-level ceiling. If a request exceeds that, the connection is closed instead of returning a graceful error.

On the read side, consumers use fetch.max.bytes and max.partition.fetch.bytes to limit how much data they receive in one fetch. Follower replicas use replica.fetch.max.bytes. Kafka returns the first record batch even when it exceeds a fetch limit, but a limit that is too small creates a permanent replication or consumption floor that shows up as under-replication or consumer lag.

Compression applies to the whole batch, not individual records. A compressed batch can be much smaller than the sum of its records, but the producer, broker, and consumer see different sizes depending on where they measure. A batch that is small when compressed can still exceed a limit if the producer uses no compression or if linger.ms and batch.size combine into an oversized batch.

flowchart LR
    A[Producer batch
max.request.size] -->|produce request| B[Broker socket
socket.request.max.bytes] B -->|validate batch| C[Broker/topic record batch limit
message.max.bytes / max.message.bytes] C -->|accept to| D[Leader log] D -->|replicate| E[Replica fetch
replica.fetch.max.bytes] D -->|serve| F[Consumer fetch
fetch.max.bytes /
max.partition.fetch.bytes]

Common causes

CauseWhat it looks likeFirst thing to check
Producer batch exceeds broker or topic limitMESSAGE_TOO_LARGE in broker logs; Java producers on affected versions can enter a split-retry loop (KAFKA-8350)Compare producer max.request.size against broker message.max.bytes and topic max.message.bytes
Topic max.message.bytes is lower than the broker limit and the producerMESSAGE_TOO_LARGE only on one topickafka-configs.sh --describe --entity-type topics --entity-name <topic>
replica.fetch.max.bytes is smaller than the effective message limitUnder-replicated partitions and ISR shrinks after large messages start arrivinggrep replica.fetch.max.bytes /path/to/server.properties on every broker
Consumer fetch limit is smaller than the message sizeConsumer stalls or lag grows after a producer change; no producer errorsConsumer config for fetch.max.bytes and max.partition.fetch.bytes
Compression is disabled or a single batch is aggregated too largeBatch size jumps after a producer config change or upgrade to Kafka 4.0Producer compression.type, batch.size, and linger.ms
Managed Kafka hard ceilingErrors persist even after all cluster configs are raisedCloud provider limit for the instance tier
Client library or connector with a hardcoded limitRecordTooLargeException from a client that ignores broker configClient source or connector docs, e.g. Trino Kafka Event Listener, MirrorMaker 2, OpenTelemetry exporter

Quick checks

Run these read-only checks before changing any config.

# Find the exact error in broker logs
grep -E "MESSAGE_TOO_LARGE|RecordTooLargeException" /var/log/kafka/server.log

# Inspect the topic-level message size limit
kafka-configs.sh --bootstrap-server localhost:9092 --describe \
  --entity-type topics --entity-name <topic>

# Inspect broker-level limits in server.properties
grep -E "message.max.bytes|replica.fetch.max.bytes|socket.request.max.bytes" \
  /etc/kafka/server.properties

# List under-replicated partitions to spot replica fetch issues
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Check consumer lag if reads are stalling
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group-id>

If you have JMX access, pull the produce error rate directly:

# Failed produce requests per second
echo "get -b kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec OneMinuteRate" \
  | java -jar jmxterm.jar -l localhost:9999

How to diagnose it

  1. Identify which side rejected the batch. RecordTooLargeException in the producer client means the producer refused to send it. MESSAGE_TOO_LARGE in broker logs or the producer response means the broker refused it. MirrorMaker 2 and some connectors can shift the error from source to target.
  2. Read the actual batch size. Producer logs often include the size of the record batch that failed. If not, derive the approximate uncompressed size from BytesInPerSec / MessagesInPerSec and factor in compression.
  3. Find the effective broker/topic ceiling. The effective limit is the topic max.message.bytes if configured; otherwise it falls back to the broker message.max.bytes. Check both. Do not assume the global default is the active limit.
  4. Check the socket-level ceiling. If the whole produce request exceeds socket.request.max.bytes, the connection is closed. This looks like a network error, not a clear MESSAGE_TOO_LARGE. The default is 100 MiB.
  5. Verify replication can pull the batch. On every broker, replica.fetch.max.bytes must be at least as large as the effective message limit. If it is smaller, followers cannot replicate large messages and the ISR shrinks.
  6. Verify consumers can fetch the batch. Consumers need fetch.max.bytes and max.partition.fetch.bytes large enough for the batch. Forward-progress logic means the first batch is still returned, but a too-small limit causes inefficient fetches and lag.
  7. Review producer batching and compression. Check compression.type, batch.size, and linger.ms. If batch.size exceeds the broker’s effective message.max.bytes, the producer can enter an infinite split-and-retry loop on versions affected by KAFKA-8350 .
  8. Check for silent data loss if you are on an affected version. KAFKA-19479 caused Kafka Streams with processing.guarantee=at_least_once to commit offsets while dropping messages that hit MESSAGE_TOO_LARGE. If you see committed offsets advancing without corresponding downstream output, patch the broker and client.
  9. If you use managed Kafka, confirm the provider ceiling. Some tiers have non-configurable limits. Raising message.max.bytes inside the cluster will not override the provider’s hard cap.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
FailedProduceRequestsPerSecDirect measure of producer-visible errors including MESSAGE_TOO_LARGESustained OneMinuteRate above 0
BytesInPerSec / MessagesInPerSecGives average message/batch size; sudden jump indicates a producer behavior changeRatio increases sharply after a deployment
UnderReplicatedPartitionsShows whether large messages are replicating; rises when replica.fetch.max.bytes is too smallNonzero outside maintenance windows
IsrShrinksPerSecVelocity of replicas falling out of syncSustained shrinks after large-message traffic starts
FetchFollower TotalTimeMsReplication path latency; high values mean followers are being served slowlyp99 approaching replica.lag.time.max.ms
Consumer lagConfirms consumers can read the large batches; lag grows if fetch limits are too tight or processing slowsLag increasing monotonically
RequestQueueTimeMs / LocalTimeMsDistinguishes broker overload from disk latency if large messages saturate I/O threadsp99 spikes in either component
Page cache pressure (pgmajfault rate)Large messages can evict hot records and shift reads to diskMajor fault rate doubles from baseline

Fixes

If the broker or topic limit is too low

Raise the topic limit first. This avoids a global change and a rolling restart:

kafka-configs.sh --bootstrap-server localhost:9092 --alter \
  --entity-type topics --entity-name <topic> \
  --add-config max.message.bytes=<new-limit>

If many topics need the same limit, raise message.max.bytes in server.properties on every broker and perform a rolling restart. Set replica.fetch.max.bytes to at least the effective message limit on every broker. If you raise message.max.bytes to 8 MiB, set replica.fetch.max.bytes to 8 MiB or higher everywhere.

If the producer is creating oversized batches

Increase max.request.size on the producer to match the broker limit, but also tune batch.size and linger.ms so batches do not grow beyond the broker’s ability to split them. Keep batch.size below the broker’s effective message.max.bytes to avoid KAFKA-8350 on affected versions. Enable compression unless your workload is already CPU-bound; compression.type=lz4 or zstd reduces batch size dramatically.

If consumers cannot fetch

Increase fetch.max.bytes and max.partition.fetch.bytes on the consumer. You do not need to match the producer limit exactly, but the per-partition limit must be larger than the largest batch you expect.

If the managed service caps you

You cannot override the provider’s ceiling. Options: split large payloads across multiple records, store the payload in object storage and pass a reference through Kafka, or move to a tier with a higher limit.

If a client library hardcodes its own limit

Some clients and connectors ignore the broker config. For MirrorMaker 2, use producer.override.max.request.size in the connector config. For other clients, check the client source or docs for an explicit cap and patch or configure accordingly.

Prevention

  • Document the end-to-end size limit chain for every cluster: producer max.request.size, broker message.max.bytes, topic max.message.bytes, broker replica.fetch.max.bytes, consumer fetch.max.bytes / max.partition.fetch.bytes, and broker socket.request.max.bytes.
  • After raising any message limit, verify replica.fetch.max.bytes is at least as large on every broker. This is the most commonly forgotten step.
  • Alert on FailedProduceRequestsPerSec and UnderReplicatedPartitions so size-related regressions do not go unnoticed.
  • Track BytesInPerSec / MessagesInPerSec to detect producer batching or message-size changes early.
  • Test large-message workloads in staging with the same compression and linger.ms settings used in production.
  • Keep brokers patched to avoid KAFKA-8350 and KAFKA-19479 if your versions are affected.
  • Do not enable unclean.leader.election.enable to work around replication issues caused by large messages. That trades a size config problem for confirmed data loss.

How Netdata helps

Netdata surfaces the broker JMX metrics that expose message-size failures and their side effects:

  • FailedProduceRequestsPerSec shows when producer-visible errors start.
  • Overlay UnderReplicatedPartitions and IsrShrinksPerSec to detect replica-fetch limits that are too small.
  • Correlate FetchFollower latency with RequestQueueTimeMs and disk await to determine whether large messages are saturating I/O threads or evicting page cache.
  • Track consumer lag and per-topic BytesInPerSec / MessagesInPerSec to spot the average batch-size increase that precedes a limit breach.
  • Use OS-level metrics such as pgmajfault rate and disk latency to catch the secondary latency cliff that large messages can cause.