An admission webhook that crashed my cluster

Validating admission webhooks are powerful and also a great way to give yourself a very bad day. Our cluster froze for 40 minutes one evening because I had deployed a webhook whose own pods could not be rescheduled without the webhook running. That is a sentence that should make any k8s operator wince.

The webhook

The webhook validated pod specs against a security policy. Nothing fancy. Standard Go code, running in a Deployment, backed by a Service, fronted by a ValidatingWebhookConfiguration:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: policy-validator
webhooks:
  - name: policy.example.com
    rules:
      - operations: ["CREATE", "UPDATE"]
        apiGroups: [""]
        apiVersions: ["v1"]
        resources: ["pods"]
    failurePolicy: Fail
    admissionReviewVersions: ["v1"]
    sideEffects: None
    clientConfig:
      service:
        name: policy-validator
        namespace: kube-system
        path: "/validate"
      caBundle: <redacted>

Note failurePolicy: Fail. If the webhook can’t be reached, pod creates fail. That’s often what you want for security policy; a paranoid default. The problem is the recursion.

The incident

The webhook Deployment had a PDB with minAvailable=1 and two replicas. A node reboot kicked both replicas off at once (the node had taints the webhook tolerated via an old broken toleration). The scheduler went to place the replacement pods, which required the kube-apiserver to CREATE the pod objects, which required the webhook to validate the CREATE, which required a webhook pod. No webhook pod. Failure. No pod created. No recovery.

kubectl get events -A --sort-by=.lastTimestamp | tail -20
# 4m   Warning  FailedCreate  replicaset/policy-validator-68c    Error creating: Internal error ... failed calling webhook
# 4m   Warning  FailedCreate  replicaset/policy-validator-68c    Error creating: Internal error ... failed calling webhook

Kubernetes event storm. The only thing keeping the rest of the cluster alive is that existing pods were fine. The problem was any new pod anywhere refused to schedule. The nodes that tried to come back up after a rolling reboot could not start their DaemonSet pods. CNI did not come up on a new node. Classic.

How I recovered

The standard recipe for unbricking a webhook-locked cluster is to make the API server skip the webhook for the webhook’s own namespace. You can edit the ValidatingWebhookConfiguration to add a namespaceSelector:

namespaceSelector:
  matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values: ["kube-system"]

But to edit the config you need to call the API server, which is fine, editing configs does not itself go through the webhook. So:

kubectl edit validatingwebhookconfigurations policy-validator

Add the namespaceSelector. Save. Now the webhook Deployment’s pods in kube-system can be created without the webhook. They come up. The webhook starts responding. The rest of the cluster recovers.

In the heat of the moment I actually deleted the ValidatingWebhookConfiguration entirely. Faster, brute force, and less safe. I reapplied it once the webhook pods were up.

What I should have done from the start

There is a well-known set of patterns for writing webhooks that do not wedge the cluster:

Always exclude the webhook’s own namespace from the webhook’s own selector. Use namespaceSelector to skip kube-system or wherever the webhook runs.
Pin the webhook pods to nodes via nodeSelector or affinity and have a redundancy story that does not depend on scheduling.
Use failurePolicy: Ignore until you are very sure the webhook is reliable.
Scope the webhook narrowly with objectSelector so that it only fires on resources you really want to validate. Fewer rules means fewer ways to brick yourself.
Set a short timeoutSeconds so that a hanging webhook doesn’t hang the API server. Default is 10, we use 3.

I knew most of these. I thought I was safe because I excluded kube-system from my object selector. But namespaceSelector and objectSelector do different things, and I had the wrong one. That was the real bug.

The new config

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: policy-validator
webhooks:
  - name: policy.example.com
    namespaceSelector:
      matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: NotIn
          values: ["kube-system", "kube-public", "kube-node-lease"]
    rules:
      - operations: ["CREATE", "UPDATE"]
        apiGroups: [""]
        apiVersions: ["v1"]
        resources: ["pods"]
    failurePolicy: Fail
    timeoutSeconds: 3
    admissionReviewVersions: ["v1"]
    sideEffects: None
    matchPolicy: Equivalent
    clientConfig:
      service:
        name: policy-validator
        namespace: kube-system
        path: "/validate"
      caBundle: <redacted>

I also moved the webhook Deployment out of kube-system into a dedicated policy-system namespace. That namespace is also excluded from the webhook. This is a defense-in-depth belt-and-suspenders move.

Chaos test

I set up a test: delete all webhook pods, wait for them to be recreated. It works now. Next chaos test is drain a node the webhook runs on; PDB prevents both replicas from being evicted at once, and the replacement comes up cleanly.

Reflection

Webhooks are one of the places in Kubernetes where the difference between “works” and “works safely” is subtle. The defaults are tolerant enough that you can get away with a lot of sloppy configuration, right up until you cannot. I keep a checklist now for any webhook I deploy. If your platform team does not have that checklist, borrow the one from the kubernetes.io docs and adapt it.