Kubernetes API server audit logging: policy, backends, and forensics

Kubernetes API server audit logging is the authoritative record of every request that reaches the control plane. It captures the identity of the caller, the resource and verb, the timestamp, the stage, and the outcome. Without it, a security investigation into unauthorized access, a compliance audit, or a postmortem into a failed certificate rotation is built on inference rather than evidence.

This article walks through enabling and configuring audit logging for a self-managed cluster. It covers writing an audit policy that balances signal and noise, configuring the file and webhook backends, verifying that events are captured correctly, and running common forensic queries against the resulting logs. It does not cover managed control planes where the provider controls the API server flags.

What this enables

Audit logging produces a structured JSON line for every request that passes through the API server. At the Metadata level, you get the user, source IP, resource, verb, and response code. At RequestResponse, you also get the request and response bodies. This is the data source you use to answer: who created this cluster-admin binding, who read a Secret outside business hours, or why did mass authentication failures start at 02:00. It is also required by compliance frameworks such as CIS Kubernetes Benchmark v1.10 for sensitive resources.

Prerequisites

  • Administrative access to control plane nodes to edit the kube-apiserver static pod manifest.
  • Kubernetes 1.27 or later. The stable audit policy API is audit.k8s.io/v1; the v1beta1 variant is deprecated and should not be used.
  • Sufficient disk capacity on control plane nodes if using the file backend, or a reachable HTTPS endpoint if using the webhook backend.
  • A maintenance window if you run a single API server instance, because applying the policy requires restarting the API server. HA deployments can be rolled without downtime.

Procedure

1. Write the audit policy

Create a policy file that the API server will read at startup. The policy uses apiVersion: audit.k8s.io/v1 and kind: Policy. Rules are evaluated in order; the first matching rule wins.

A minimal production policy looks like this:

apiVersion: audit.k8s.io/v1
kind: Policy
omitStages:
  - "RequestReceived"
omitManagedFields: true
rules:
  - level: None
    users: ["system:kube-proxy"]
    verbs: ["watch"]
    resources:
      - group: ""
        resources: ["endpoints", "services"]
  - level: Metadata
    resources:
      - group: ""
        resources: ["secrets"]
  - level: RequestResponse
    resources:
      - group: "rbac.authorization.k8s.io"
        resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods", "deployments", "serviceaccounts"]
  - level: Metadata
    omitStages:
      - "ResponseStarted"

Key decisions in this file:

  • omitStages: Suppress RequestReceived globally. This stage creates an event before the request is authenticated or authorized, which produces noise without adding security value.
  • omitManagedFields: Set to true to strip server-side apply field manager metadata from events. This significantly reduces log volume on clusters with many controllers.
  • Levels: Use Metadata for high-volume, low-sensitivity reads such as Secret get/list. Using RequestResponse on Secret reads would write secret values into the log, which is a security vulnerability. Use RequestResponse for mutating operations and RBAC changes where the full body is needed for forensics.
  • Order: The None rule for kube-proxy watch traffic prevents endpoint watch floods from drowning out important events. Place the most specific rules first.

2. Configure the file backend

Mount the policy into the kube-apiserver static pod and add the following flags:

  • --audit-policy-file=/etc/kubernetes/audit-policy.yaml
  • --audit-log-path=/var/log/kubernetes/audit.log
  • --audit-log-maxsize=100
  • --audit-log-maxbackup=10
  • --audit-log-maxage=30

Setting --audit-log-path=- writes audit events to stdout, which is useful if your control plane logging ships journald directly. If you write to a file, ensure the directory exists and the API server process has write permissions.

The maxsize, maxbackup, and maxage flags control rotation. Without rotation, a busy cluster can fill a disk in hours.

3. Configure the webhook backend (optional)

You can send events to an external SIEM or log aggregator simultaneously with the file backend. Add:

  • --audit-webhook-config-file=/etc/kubernetes/audit-webhook.yaml

The referenced file is a kubeconfig-style document that points to the remote HTTPS endpoint and includes the CA bundle for verifying the remote server. The webhook backend buffers events before sending. If the destination is unreachable, events can be dropped when the buffer overflows, so treat the file backend as your durable source of truth.

Because the webhook is called in the request path, a slow or unreachable webhook can stall API requests if not properly decoupled. The default buffered mode mitigates this, but monitor the delivery path.

4. Mount and restart the API server

Add hostPath volumes for the policy file and webhook config to the kube-apiserver static pod manifest, then mount them into the container at the paths referenced by the flags.

There is no dynamic reload for audit policy or webhook configuration. Changing the policy requires restarting the API server pod. In a single-instance control plane, schedule this during a maintenance window. In an HA deployment, restart instances one at a time and confirm /readyz passes before proceeding to the next.

Verifying it works

After restart, generate a test event and inspect the output.

  1. Create a test resource:

    kubectl create configmap audit-test --from-literal=key=value
    
  2. Read the audit log:

    grep '"objectRef":{"resource":"configmaps"' /var/log/kubernetes/audit.log | tail -1 | jq .
    
  3. Confirm the expected fields are present: user.username, verb, requestURI, responseStatus.code, and stageTimestamp.

  4. If you enabled RequestResponse for configmaps, confirm the requestObject contains the configmap data but that no secret values appear for secret reads if you restricted those to Metadata.

  5. Check for audit annotations that indicate policy enforcement:

    grep 'authorization.k8s.io/decision' /var/log/kubernetes/audit.log | head -5
    

Common pitfalls

  • Secret leakage in RequestResponse logs: At RequestResponse level, the full request and response bodies are logged. If a Secret is created or updated, its values appear verbatim in the audit log. Use Metadata level for secret reads.
  • Log volume and disk exhaustion: A busy cluster can generate gigabytes of audit data per hour. Pair rotation with host-level log shipping or a sidecar that moves logs to cold storage before deletion.
  • Performance impact on slow disks: Writing high-volume audit logs to a shared or network-attached disk can stall the API server. Place the audit log on a local SSD or fast volume.
  • Truncation disabled by default: Large payloads from large ConfigMaps or CustomResources can produce oversized events. If --audit-log-max-size is not set, these events may be silently dropped by the log backend rather than truncated. The audit.k8s.io/truncated annotation is only emitted when truncation is active.
  • No hot reload: Operators sometimes edit the policy file and forget to restart the API server. Verify the running pod spec matches the intended configuration after any change.
  • Cloud provider defaults vary: EKS, GKE, and AKS each ship default audit policies that are typically less verbose than a hardened policy. Verify the effective policy via the API server pod spec rather than assuming coverage.

Forensics

Once logging is active, use jq or grep to answer common investigation questions.

Find gaps in the log that might indicate API server unavailability or backend failure:

tail -1000 /var/log/kubernetes/audit.log | \
  jq -r '.stageTimestamp' | \
  while read ts; do
    if [ -n "$prev" ]; then
      diff=$(($(date -d "$ts" +%s) - $(date -d "$prev" +%s)))
      if [ $diff -gt 60 ]; then
        echo "Gap of ${diff}s between $prev and $ts"
      fi
    fi
    prev="$ts"
  done

Find anonymous requests:

grep '"username":"system:anonymous"' /var/log/kubernetes/audit.log | tail -20

Find RBAC privilege escalation:

grep -E '"resource":"clusterrolebindings".*"verb":"(create|update)"' \
  /var/log/kubernetes/audit.log | grep -i "cluster-admin" | tail -20

Find secret access across namespaces:

grep '"resource":"secrets"' /var/log/kubernetes/audit.log | tail -50

Find deprecated API usage:

grep 'k8s.io/deprecated' /var/log/kubernetes/audit.log | \
  jq -r '.annotations["k8s.io/removed-release"]' | sort | uniq -c

Signals to monitor

SignalWhy it mattersWarning sign
Audit log gaps > 60sUnexplained gaps indicate API server downtime or audit backend failureAny gap during active cluster hours
Anonymous request rateAnonymous requests may indicate misconfiguration or probingSpike above baseline or successful access to non-public resources
Audit log disk usageFull disk stops audit logging and can crash the API serverDisk > 80% of capacity on the audit volume
RBAC modification rateUnexpected bindings may indicate compromiseAny cluster-admin binding outside change management
Webhook delivery failuresDropped events break the forensics trailFailed webhook delivery rate > 0

How Netdata helps

  • Correlate audit log gaps with the apiserver_request_total error rate and etcd health metrics to distinguish API server crashes from backend disk failures.
  • Track control plane node disk utilization to receive advance warning before audit log volume fills the volume.
  • Monitor API server request latency spikes that coincide with webhook backend latency, identifying when the audit pipeline is adding synchronous overhead.