An operator reconcile loop that wouldn't quit

I wrote a small operator for managing a specific in-house CRD. It worked. It passed tests. Then I deployed it to staging and watched the API server log show 300 writes per second from a single controller. The reconcile loop had gone berserk.

The shape of the problem

The CRD was simple: a Thing custom resource that reconciled into a Deployment plus a Service. The controller-runtime reconciler was the usual pattern: fetch the Thing, build the desired state, server-side apply, update status. Not rocket surgery.

In staging I saw:

kubectl top pods -n thing-system
# NAME                    CPU     MEMORY
# thing-controller-0      1800m   280Mi

1.8 cores burned on a reconciler that had nothing to do. kubectl get events showed what I expected in a thrash:

kubectl get events -n app --field-selector involvedObject.kind=Deployment --watch
# 0s  Normal  ScalingReplicaSet   Deployment/demo   Scaled up replica set demo-abc
# 1s  Normal  ScalingReplicaSet   Deployment/demo   Scaled up replica set demo-abc
# 1s  Normal  ScalingReplicaSet   Deployment/demo   Scaled up replica set demo-abc

Scaled up. Then scaled up again. Without an intervening scale down. Same revision. The Deployment itself was content, but something kept touching it.

What I was writing

My apply logic set an annotation to help humans debug:

obj.SetAnnotations(map[string]string{
    "thing.example.com/last-reconciled": time.Now().Format(time.RFC3339),
    "thing.example.com/source-hash": hashSpec(spec),
})

You can see the problem. time.Now() is different every reconcile. Server-side apply with a different annotation value constitutes a change. A change bumps resourceVersion. An informer watches for resourceVersion changes. A change event triggers a reconcile. Which sets a new time.Now(). Welcome to the feedback loop.

It would have been less catastrophic if the apply had not also been setting field ownership on the annotation. With SSA, the thing-controller took ownership of the annotation. The annotation value caused a generation bump (actually no, only spec changes bump generation, but it did bump resourceVersion and metadata.managedFields). The informer’s cache saw a new version, called back, and the reconciler re-ran unconditionally.

The telltale signs

I had wired up the standard controller-runtime metrics. Two of them told the story:

kubectl -n thing-system port-forward pod/thing-controller-0 8080
curl -s localhost:8080/metrics | grep controller_runtime_reconcile_time_seconds

P99 reconcile time was 3 ms. P50 was 2 ms. That is too fast. A healthy reconciler for this CRD should take 20-60 ms because of the Deployment and Service writes. When I saw reconciles finishing in 2 ms, I knew it was fast-pathing through an early return after it did its write and came back. That is not a feature of controller-runtime, that is me being sloppy.

I also looked at the client QPS:

curl -s localhost:8080/metrics | grep rest_client_requests_total
# rest_client_requests_total{code="200",method="GET"} 42311
# rest_client_requests_total{code="200",method="PATCH"} 18402

18k PATCHes in an hour. This is what eventually tips you off.

The fix

Two changes:

Drop the timestamp annotation. If I want to know when a reconcile happened, I have logs and metrics. I do not need to write that information into the managed resource.
Keep the source-hash annotation, but put it on the owner resource (the Thing) in status, not on the managed Deployment. And only bump it when the hash actually changes:
```
if thing.Status.SpecHash != hashSpec(spec) {
    thing.Status.SpecHash = hashSpec(spec)
    // patch status subresource, not spec
}
```

Also, I added a predicate to the controller builder so that Deployment changes owned by my Thing did not wake up the reconciler unless their spec changed:

ctrl.NewControllerManagedBy(mgr).
    For(&appv1.Thing{}).
    Owns(&appsv1.Deployment{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})).
    Complete(r)

GenerationChangedPredicate is a nice filter that only triggers when spec (which is what bumps generation) changes. Annotation and label churn don’t wake us up anymore.

Reflection

Operator loops with bad idempotence are embarrassingly easy to write. The controller-runtime scaffolding protects you from a lot, but you can defeat it if you are not careful about exactly what you write to a managed resource. My rule of thumb now is: if a field will have a different value on every reconcile, that field does not belong in the object. Status, metrics, logs, or tracing are better homes for runtime observations.

I also learned that the QPS metric is the first place to look when a controller seems happy but the cluster feels stressed. Our platform team now has a dashboard panel per controller namespace with alerts at >10 PATCH/sec sustained.

Related: see my post on CRD design mistakes I made early on for more self-inflicted controller pain.