$ guides / kubernetes / kubernetes-job-cronjob-troubleshooting ▌

Operations Guides

Kubernetes Job and CronJob troubleshooting: history, backoff, and missed runs

You deployed a CronJob to run every minute, but the last successful run was three hours ago. Or a critical data-processing Job failed with BackoffLimitExceeded after six silent retries, leaving a trail of failed Pods and no clear signal about what broke. Batch workloads fail differently from long-running services: they are time-bound, retry-sensitive, and leave debris in etcd if you do not clean them up. Read the failure signals, distinguish retry storms from control plane delays, and fix the root cause without guessing.

What this means

A Job creates Pods and retries failures until enough succeed. A CronJob wraps a Job template with a schedule. Three mechanisms cause most operator confusion:

backoffLimit: The number of Pod failures allowed before the Job is marked Failed with reason BackoffLimitExceeded. The default is 6. Retries are counted differently depending on restartPolicy. With OnFailure, container restarts within the same Pod count. With Never, each failed Pod counts as one retry.
History limits: successfulJobsHistoryLimit (default 3) and failedJobsHistoryLimit (default 1) control how many completed Jobs a CronJob retains. Older Jobs are deleted automatically, but if these fields are unset or misconfigured, completed Jobs and their Pods can accumulate and bloat etcd.
Missed runs and startingDeadlineSeconds: The CronJob controller evaluates missed schedules from the last scheduled time until now. If more than 100 missed schedules accumulate, the controller stops starting Jobs entirely and logs a warning. The startingDeadlineSeconds field defines a catch-up window; values below 10 seconds often cause silent skips because the controller reconciles on roughly 10-second intervals.

Common causes

Cause	What it looks like	First thing to check
backoffLimit exhausted	Job status Failed with reason BackoffLimitExceeded; multiple failed Pods with increasing restart counts	Pod logs and events for the most recent failure
History accumulation	Hundreds of completed Jobs or Pods in the namespace; etcd database size growing	`successfulJobsHistoryLimit` and `failedJobsHistoryLimit` on the CronJob
Missed runs due to control plane latency	CronJob status shows no recent active Jobs; events mention “Cannot determine if job needs to be started”	API server and etcd latency; node readiness
startingDeadlineSeconds too tight	CronJob with a sub-10-second window never creates Jobs	The CronJob `spec.startingDeadlineSeconds` value
concurrencyPolicy Forbid masking slowness	CronJob skips runs silently because the previous Job is still active	Duration of the active Job versus the CronJob schedule interval
Disruption consuming retry budget	Pods evicted or preempted count toward backoffLimit, exhausting retries before the workload runs	Pod status and events for `DisruptionTarget` or eviction
Missing cleanup	Completed Jobs and Pods remain indefinitely; node disk or etcd pressure increases	`ttlSecondsAfterFinished` and `activeDeadlineSeconds` settings

Quick checks

Run these checks in order. All are read-only and safe to run during an incident.

# List Jobs and their status
kubectl get jobs -n <namespace> -o wide

# Inspect a specific Job for conditions and events
kubectl describe job <job-name> -n <namespace>

# Check CronJob schedule, history limits, and last run times
kubectl describe cronjob <cronjob-name> -n <namespace>

# View logs from the most recent failed Pod (use --previous only if the container restarted)
kubectl logs -n <namespace> <pod-name>

# Check for backoff or deadline events
kubectl get events -n <namespace> --field-selector reason=BackoffLimitExceeded

# Verify if CronJob active Jobs are blocking new runs
kubectl get cronjob <cronjob-name> -n <namespace> -o jsonpath='{.status.active}'

# Check cluster-wide Job count to spot accumulation
kubectl get jobs --all-namespaces --no-headers | wc -l

# Check etcd database size if history limits are high or unset
# Adjust certificate paths to match your cluster (example paths below are for kubeadm)
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint status --cluster -w table

How to diagnose it

Use this flow to narrow down whether the issue is retry exhaustion, scheduling delay, or control plane backlog.

Determine the Job status. Run kubectl get job <name> -o jsonpath='{.status.conditions}'. A condition with type Failed and reason BackoffLimitExceeded means the retry budget is spent. A condition with type Complete means the Job finished but may not have been cleaned up. If there are no conditions and .status.active shows Pods, the Job is still running.
Read the most recent Pod failure. Find the newest Pod owned by the Job (kubectl get pods --selector=job-name=<job-name>). Check its container status state.terminated.reason (or lastState.terminated.reason if the container restarted). OOMKilled, Error, or ContainerCannotRun point to workload issues. Evicted or DisruptionTarget point to cluster-level pressure.
Check for missed run semantics on CronJobs. Look at the CronJob status fields lastScheduleTime and lastSuccessfulTime. If lastScheduleTime is far behind the current time and no Job exists, the controller may have skipped the run. Check events for the message “Cannot determine if job needs to be started. Too many missed start times (>100).” This appears when the controller has given up on catch-up.
Validate startingDeadlineSeconds. If .spec.startingDeadlineSeconds is set, ensure it is at least 10 seconds. Values below 10 seconds are known to cause silent skips because the CronJob controller reconciles on approximately 10-second intervals.
Check concurrency and active Job count. For a CronJob with concurrencyPolicy: Forbid, a long-running Job blocks all subsequent scheduled runs. List Jobs in the namespace and inspect the ACTIVE column for Jobs owned by the CronJob. If a Job has been active longer than the schedule interval, subsequent runs were silently skipped.
Inspect control plane health if runs are missing cluster-wide. Slow API server writes or high etcd fsync latency can delay CronJob creation. Check API server latency and etcd leader stability. See How the Kubernetes control plane works and Kubernetes API server etcd latency.
Audit history and retention. Check successfulJobsHistoryLimit and failedJobsHistoryLimit. If they are set to high values or omitted on custom controllers, completed Jobs accumulate. Check the total Job count across namespaces. High counts correlate with slow LIST operations and etcd bloat.

flowchart TD
    A[CronJob missed or Job failed] --> B{Job status condition}
    B -->|Failed: BackoffLimitExceeded| C[Inspect latest pod failure reason]
    B -->|Complete but retained| D[Check ttlSecondsAfterFinished and history limits]
    B -->|Active / no condition| E{Is it a CronJob?}
    E -->|Yes| F{Check missed runs}
    F -->|Last schedule far behind| G[Check startingDeadlineSeconds and control plane latency]
    F -->|Blocked by active job| H[Check concurrencyPolicy and job duration]
    E -->|No| I[Check resource pressure and pod logs]
    C --> J{Failure type}
    J -->|Resource or app error| K[Fix workload spec or dependencies]
    J -->|Disruption / eviction| L[Add podFailurePolicy or reduce churn]
    G --> M[Recreate CronJob if >100 missed schedules]

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Job completion duration	Long-running Jobs block CronJob schedules and consume cluster resources	p99 completion duration exceeds the CronJob interval
CronJob missed schedules	Controller could not create a Job on time	Increase in missed schedule counter or stale `lastScheduleTime`
Pod restart count for Jobs	Repeated restarts consume the backoffLimit budget quickly	Restart count approaching backoffLimit before completion
API server mutating request latency	Slow writes delay Job and CronJob object creation	p99 latency > 1s sustained
etcd WAL fsync latency	Slow etcd causes the API server to stall, which delays CronJob reconciliation	p99 fsync > 100ms
Controller workqueue depth	Backlog in the CronJob or Job controller means reconciliation is falling behind	Depth > 0 sustained
Cluster Job object count	Accumulated Jobs increase etcd size and LIST latency	Total Jobs growing without bound over days
Pod eviction events	Evicted batch Pods waste retries and can trigger backoffLimit exhaustion	Eviction events correlated with Job Pod names

Fixes

If the cause is backoff exhaustion

Increase backoffLimit only if the workload is legitimately retryable, such as transient network dependencies. If the Job fails because of a code bug or missing ConfigMap, raising the limit will only create noise. Set activeDeadlineSeconds to cap total runtime and prevent runaway retries. If supported in your cluster, add a podFailurePolicy to exclude DisruptionTarget conditions from the backoff count, preventing node pressure or preemption from exhausting retries.

If the cause is missed runs or schedule skew

Set startingDeadlineSeconds to a value between 10 and 300 seconds, depending on your tolerance for catch-up. Never set it below 10 seconds. If the CronJob has accumulated more than 100 missed schedules, delete and recreate the CronJob resource. The controller does not recover automatically from this state. If your workload must not overlap runs, use concurrencyPolicy: Forbid, but monitor the Job completion time and alert if it approaches the schedule interval.

If the cause is history accumulation

Set ttlSecondsAfterFinished on the Job template or on standalone Jobs so the TTL controller deletes them automatically. For CronJobs, lower successfulJobsHistoryLimit to 1 or 2 if you only need the last run for debugging, and set failedJobsHistoryLimit to 2 or 3 to retain enough context without hoarding objects. If you use a GitOps workflow, ensure your templating does not override these fields to zero or unset on every sync.

If the cause is control plane latency

Fixing the batch workload spec will not help if the API server or etcd is the bottleneck. Follow the control plane latency troubleshooting path. Reduce etcd object churn by cleaning up completed Jobs and events, and verify that the CronJob controller is not being throttled by API Priority and Fairness queues.

Prevention

Set TTL and history limits by default. Every CronJob manifest should specify ttlSecondsAfterFinished, successfulJobsHistoryLimit, and failedJobsHistoryLimit. Every standalone Job should specify ttlSecondsAfterFinished.
Monitor schedule skew. Alert when lastScheduleTime on a CronJob is older than 1.5 * schedule_interval. This catches silent skips before they become an outage.
Test control plane maintenance windows. During API server upgrades or etcd compaction, CronJob creation can lag. Know whether your critical batch workloads can tolerate a 1-5 minute delay.
Align resource requests with peak batch usage. A Job that requests 100m CPU but briefly spikes to 2 cores will be throttled or evicted, wasting retries. Size requests based on observed peak usage.
Use Pod failure policies to ignore non-application failures. Explicitly exclude DisruptionTarget and other cluster-level conditions so node maintenance does not burn the retry budget.

How Netdata helps

Correlate CronJob missed runs with API server write latency and etcd fsync duration to determine whether the control plane is the bottleneck.
Monitor per-node CPU throttling and memory pressure that cause batch Pod evictions, which in turn consume backoffLimit retries.
Track controller workqueue depth and Job object counts to spot accumulation trends before etcd bloat becomes critical.
Alert on sustained increases in Pod restart counts for Jobs, providing an early signal that a batch workload is failing before backoffLimit is reached.

The Netdata solution

Kubernetes monitoring with Netdata

Netdata monitors Kubernetes with per-second metrics across the control plane, nodes, and every pod, with ML anomaly detection and zero per-pod configuration. Correlate API-server and etcd latency, kubelet PLEG stalls, scheduling pressure, and OOMKills in one place.

See Kubernetes monitoring → Start monitoring free

Kubernetes Job and CronJob troubleshooting: history, backoff, and missed runs

Kubernetes Job and CronJob troubleshooting: history, backoff, and missed runs

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If the cause is backoff exhaustion

If the cause is missed runs or schedule skew

If the cause is history accumulation

If the cause is control plane latency

Prevention

How Netdata helps

Related guides

Kubernetes monitoring with Netdata