Kubernetes Job and CronJob troubleshooting: history, backoff, and missed runs

You deployed a CronJob to run every minute, but the last successful run was three hours ago. Or a critical data-processing Job failed with BackoffLimitExceeded after six silent retries, leaving a trail of failed Pods and no clear signal about what broke. Batch workloads fail differently from long-running services: they are time-bound, retry-sensitive, and leave debris in etcd if you do not clean them up. Read the failure signals, distinguish retry storms from control plane delays, and fix the root cause without guessing.

What this means

A Job creates Pods and retries failures until enough succeed. A CronJob wraps a Job template with a schedule. Three mechanisms cause most operator confusion:

  • backoffLimit: The number of Pod failures allowed before the Job is marked Failed with reason BackoffLimitExceeded. The default is 6. Retries are counted differently depending on restartPolicy. With OnFailure, container restarts within the same Pod count. With Never, each failed Pod counts as one retry.
  • History limits: successfulJobsHistoryLimit (default 3) and failedJobsHistoryLimit (default 1) control how many completed Jobs a CronJob retains. Older Jobs are deleted automatically, but if these fields are unset or misconfigured, completed Jobs and their Pods can accumulate and bloat etcd.
  • Missed runs and startingDeadlineSeconds: The CronJob controller evaluates missed schedules from the last scheduled time until now. If more than 100 missed schedules accumulate, the controller stops starting Jobs entirely and logs a warning. The startingDeadlineSeconds field defines a catch-up window; values below 10 seconds often cause silent skips because the controller reconciles on roughly 10-second intervals.

Common causes

CauseWhat it looks likeFirst thing to check
backoffLimit exhaustedJob status Failed with reason BackoffLimitExceeded; multiple failed Pods with increasing restart countsPod logs and events for the most recent failure
History accumulationHundreds of completed Jobs or Pods in the namespace; etcd database size growingsuccessfulJobsHistoryLimit and failedJobsHistoryLimit on the CronJob
Missed runs due to control plane latencyCronJob status shows no recent active Jobs; events mention “Cannot determine if job needs to be started”API server and etcd latency; node readiness
startingDeadlineSeconds too tightCronJob with a sub-10-second window never creates JobsThe CronJob spec.startingDeadlineSeconds value
concurrencyPolicy Forbid masking slownessCronJob skips runs silently because the previous Job is still activeDuration of the active Job versus the CronJob schedule interval
Disruption consuming retry budgetPods evicted or preempted count toward backoffLimit, exhausting retries before the workload runsPod status and events for DisruptionTarget or eviction
Missing cleanupCompleted Jobs and Pods remain indefinitely; node disk or etcd pressure increasesttlSecondsAfterFinished and activeDeadlineSeconds settings

Quick checks

Run these checks in order. All are read-only and safe to run during an incident.

# List Jobs and their status
kubectl get jobs -n <namespace> -o wide

# Inspect a specific Job for conditions and events
kubectl describe job <job-name> -n <namespace>

# Check CronJob schedule, history limits, and last run times
kubectl describe cronjob <cronjob-name> -n <namespace>

# View logs from the most recent failed Pod (use --previous only if the container restarted)
kubectl logs -n <namespace> <pod-name>

# Check for backoff or deadline events
kubectl get events -n <namespace> --field-selector reason=BackoffLimitExceeded

# Verify if CronJob active Jobs are blocking new runs
kubectl get cronjob <cronjob-name> -n <namespace> -o jsonpath='{.status.active}'

# Check cluster-wide Job count to spot accumulation
kubectl get jobs --all-namespaces --no-headers | wc -l

# Check etcd database size if history limits are high or unset
# Adjust certificate paths to match your cluster (example paths below are for kubeadm)
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint status --cluster -w table

How to diagnose it

Use this flow to narrow down whether the issue is retry exhaustion, scheduling delay, or control plane backlog.

  1. Determine the Job status. Run kubectl get job <name> -o jsonpath='{.status.conditions}'. A condition with type Failed and reason BackoffLimitExceeded means the retry budget is spent. A condition with type Complete means the Job finished but may not have been cleaned up. If there are no conditions and .status.active shows Pods, the Job is still running.
  2. Read the most recent Pod failure. Find the newest Pod owned by the Job (kubectl get pods --selector=job-name=<job-name>). Check its container status state.terminated.reason (or lastState.terminated.reason if the container restarted). OOMKilled, Error, or ContainerCannotRun point to workload issues. Evicted or DisruptionTarget point to cluster-level pressure.
  3. Check for missed run semantics on CronJobs. Look at the CronJob status fields lastScheduleTime and lastSuccessfulTime. If lastScheduleTime is far behind the current time and no Job exists, the controller may have skipped the run. Check events for the message “Cannot determine if job needs to be started. Too many missed start times (>100).” This appears when the controller has given up on catch-up.
  4. Validate startingDeadlineSeconds. If .spec.startingDeadlineSeconds is set, ensure it is at least 10 seconds. Values below 10 seconds are known to cause silent skips because the CronJob controller reconciles on approximately 10-second intervals.
  5. Check concurrency and active Job count. For a CronJob with concurrencyPolicy: Forbid, a long-running Job blocks all subsequent scheduled runs. List Jobs in the namespace and inspect the ACTIVE column for Jobs owned by the CronJob. If a Job has been active longer than the schedule interval, subsequent runs were silently skipped.
  6. Inspect control plane health if runs are missing cluster-wide. Slow API server writes or high etcd fsync latency can delay CronJob creation. Check API server latency and etcd leader stability. See How the Kubernetes control plane works and Kubernetes API server etcd latency.
  7. Audit history and retention. Check successfulJobsHistoryLimit and failedJobsHistoryLimit. If they are set to high values or omitted on custom controllers, completed Jobs accumulate. Check the total Job count across namespaces. High counts correlate with slow LIST operations and etcd bloat.
flowchart TD
    A[CronJob missed or Job failed] --> B{Job status condition}
    B -->|Failed: BackoffLimitExceeded| C[Inspect latest pod failure reason]
    B -->|Complete but retained| D[Check ttlSecondsAfterFinished and history limits]
    B -->|Active / no condition| E{Is it a CronJob?}
    E -->|Yes| F{Check missed runs}
    F -->|Last schedule far behind| G[Check startingDeadlineSeconds and control plane latency]
    F -->|Blocked by active job| H[Check concurrencyPolicy and job duration]
    E -->|No| I[Check resource pressure and pod logs]
    C --> J{Failure type}
    J -->|Resource or app error| K[Fix workload spec or dependencies]
    J -->|Disruption / eviction| L[Add podFailurePolicy or reduce churn]
    G --> M[Recreate CronJob if >100 missed schedules]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Job completion durationLong-running Jobs block CronJob schedules and consume cluster resourcesp99 completion duration exceeds the CronJob interval
CronJob missed schedulesController could not create a Job on timeIncrease in missed schedule counter or stale lastScheduleTime
Pod restart count for JobsRepeated restarts consume the backoffLimit budget quicklyRestart count approaching backoffLimit before completion
API server mutating request latencySlow writes delay Job and CronJob object creationp99 latency > 1s sustained
etcd WAL fsync latencySlow etcd causes the API server to stall, which delays CronJob reconciliationp99 fsync > 100ms
Controller workqueue depthBacklog in the CronJob or Job controller means reconciliation is falling behindDepth > 0 sustained
Cluster Job object countAccumulated Jobs increase etcd size and LIST latencyTotal Jobs growing without bound over days
Pod eviction eventsEvicted batch Pods waste retries and can trigger backoffLimit exhaustionEviction events correlated with Job Pod names

Fixes

If the cause is backoff exhaustion

Increase backoffLimit only if the workload is legitimately retryable, such as transient network dependencies. If the Job fails because of a code bug or missing ConfigMap, raising the limit will only create noise. Set activeDeadlineSeconds to cap total runtime and prevent runaway retries. If supported in your cluster, add a podFailurePolicy to exclude DisruptionTarget conditions from the backoff count, preventing node pressure or preemption from exhausting retries.

If the cause is missed runs or schedule skew

Set startingDeadlineSeconds to a value between 10 and 300 seconds, depending on your tolerance for catch-up. Never set it below 10 seconds. If the CronJob has accumulated more than 100 missed schedules, delete and recreate the CronJob resource. The controller does not recover automatically from this state. If your workload must not overlap runs, use concurrencyPolicy: Forbid, but monitor the Job completion time and alert if it approaches the schedule interval.

If the cause is history accumulation

Set ttlSecondsAfterFinished on the Job template or on standalone Jobs so the TTL controller deletes them automatically. For CronJobs, lower successfulJobsHistoryLimit to 1 or 2 if you only need the last run for debugging, and set failedJobsHistoryLimit to 2 or 3 to retain enough context without hoarding objects. If you use a GitOps workflow, ensure your templating does not override these fields to zero or unset on every sync.

If the cause is control plane latency

Fixing the batch workload spec will not help if the API server or etcd is the bottleneck. Follow the control plane latency troubleshooting path. Reduce etcd object churn by cleaning up completed Jobs and events, and verify that the CronJob controller is not being throttled by API Priority and Fairness queues.

Prevention

  • Set TTL and history limits by default. Every CronJob manifest should specify ttlSecondsAfterFinished, successfulJobsHistoryLimit, and failedJobsHistoryLimit. Every standalone Job should specify ttlSecondsAfterFinished.
  • Monitor schedule skew. Alert when lastScheduleTime on a CronJob is older than 1.5 * schedule_interval. This catches silent skips before they become an outage.
  • Test control plane maintenance windows. During API server upgrades or etcd compaction, CronJob creation can lag. Know whether your critical batch workloads can tolerate a 1-5 minute delay.
  • Align resource requests with peak batch usage. A Job that requests 100m CPU but briefly spikes to 2 cores will be throttled or evicted, wasting retries. Size requests based on observed peak usage.
  • Use Pod failure policies to ignore non-application failures. Explicitly exclude DisruptionTarget and other cluster-level conditions so node maintenance does not burn the retry budget.

How Netdata helps

  • Correlate CronJob missed runs with API server write latency and etcd fsync duration to determine whether the control plane is the bottleneck.
  • Monitor per-node CPU throttling and memory pressure that cause batch Pod evictions, which in turn consume backoffLimit retries.
  • Track controller workqueue depth and Job object counts to spot accumulation trends before etcd bloat becomes critical.
  • Alert on sustained increases in Pod restart counts for Jobs, providing an early signal that a batch workload is failing before backoffLimit is reached.