An admission webhook that crashed my cluster
Validating admission webhooks are powerful and also a great way to give yourself a very bad day. Our cluster froze for 40 minutes one evening because I had deployed a webhook whose own pods could not be rescheduled without the webhook running. That is a sentence that should make any k8s operator wince.
The webhook
The webhook validated pod specs against a security policy. Nothing fancy. Standard Go code, running in a Deployment, backed by a Service, fronted by a ValidatingWebhookConfiguration:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: policy-validator
webhooks:
- name: policy.example.com
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
failurePolicy: Fail
admissionReviewVersions: ["v1"]
sideEffects: None
clientConfig:
service:
name: policy-validator
namespace: kube-system
path: "/validate"
caBundle: <redacted>
Note failurePolicy: Fail. If the webhook can’t be reached, pod creates fail. That’s often what you want for security policy; a paranoid default. The problem is the recursion.
The incident
The webhook Deployment had a PDB with minAvailable=1 and two replicas. A node reboot kicked both replicas off at once (the node had taints the webhook tolerated via an old broken toleration). The scheduler went to place the replacement pods, which required the kube-apiserver to CREATE the pod objects, which required the webhook to validate the CREATE, which required a webhook pod. No webhook pod. Failure. No pod created. No recovery.
kubectl get events -A --sort-by=.lastTimestamp | tail -20
# 4m Warning FailedCreate replicaset/policy-validator-68c Error creating: Internal error ... failed calling webhook
# 4m Warning FailedCreate replicaset/policy-validator-68c Error creating: Internal error ... failed calling webhook
Kubernetes event storm. The only thing keeping the rest of the cluster alive is that existing pods were fine. The problem was any new pod anywhere refused to schedule. The nodes that tried to come back up after a rolling reboot could not start their DaemonSet pods. CNI did not come up on a new node. Classic.
How I recovered
The standard recipe for unbricking a webhook-locked cluster is to make the API server skip the webhook for the webhook’s own namespace. You can edit the ValidatingWebhookConfiguration to add a namespaceSelector:
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: ["kube-system"]
But to edit the config you need to call the API server, which is fine, editing configs does not itself go through the webhook. So:
kubectl edit validatingwebhookconfigurations policy-validator
Add the namespaceSelector. Save. Now the webhook Deployment’s pods in kube-system can be created without the webhook. They come up. The webhook starts responding. The rest of the cluster recovers.
In the heat of the moment I actually deleted the ValidatingWebhookConfiguration entirely. Faster, brute force, and less safe. I reapplied it once the webhook pods were up.
What I should have done from the start
There is a well-known set of patterns for writing webhooks that do not wedge the cluster:
- Always exclude the webhook’s own namespace from the webhook’s own selector. Use
namespaceSelectorto skipkube-systemor wherever the webhook runs. - Pin the webhook pods to nodes via
nodeSelectororaffinityand have a redundancy story that does not depend on scheduling. - Use
failurePolicy: Ignoreuntil you are very sure the webhook is reliable. - Scope the webhook narrowly with
objectSelectorso that it only fires on resources you really want to validate. Fewer rules means fewer ways to brick yourself. - Set a short
timeoutSecondsso that a hanging webhook doesn’t hang the API server. Default is 10, we use 3.
I knew most of these. I thought I was safe because I excluded kube-system from my object selector. But namespaceSelector and objectSelector do different things, and I had the wrong one. That was the real bug.
The new config
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: policy-validator
webhooks:
- name: policy.example.com
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: ["kube-system", "kube-public", "kube-node-lease"]
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
failurePolicy: Fail
timeoutSeconds: 3
admissionReviewVersions: ["v1"]
sideEffects: None
matchPolicy: Equivalent
clientConfig:
service:
name: policy-validator
namespace: kube-system
path: "/validate"
caBundle: <redacted>
I also moved the webhook Deployment out of kube-system into a dedicated policy-system namespace. That namespace is also excluded from the webhook. This is a defense-in-depth belt-and-suspenders move.
Chaos test
I set up a test: delete all webhook pods, wait for them to be recreated. It works now. Next chaos test is drain a node the webhook runs on; PDB prevents both replicas from being evicted at once, and the replacement comes up cleanly.
Reflection
Webhooks are one of the places in Kubernetes where the difference between “works” and “works safely” is subtle. The defaults are tolerant enough that you can get away with a lot of sloppy configuration, right up until you cannot. I keep a checklist now for any webhook I deploy. If your platform team does not have that checklist, borrow the one from the kubernetes.io docs and adapt it.
Related: see my post on a CRD design mistakes I made early on and on the reconcile loop that would not quit.