Terraform state locks and the S3 bucket that wouldn't let go

Terraform’s state locking is one of those features you do not think about until it bites you. A CI runner got oomkilled mid-terraform apply and left a DynamoDB lock behind. I did what every tutorial told me and ran terraform force-unlock. That was, in retrospect, not the right call.

The situation

Our Terraform state lives in S3 with DynamoDB for locking. Standard pattern. The backend config:

terraform {
  backend "s3" {
    bucket         = "example-tf-state-prod"
    key            = "networking/main.tfstate"
    region         = "us-east-1"
    dynamodb_table = "example-tf-locks"
    encrypt        = true
  }
}

The CI job ran terraform apply -auto-approve as usual. Partway through provisioning a set of VPC route tables it was killed by cgroup OOM, because someone had added a lot of for_each loops over a large locals map and memory spiked. Fine, we have seen that before. Restart the job.

On retry:

terraform plan
# Error: Error acquiring the state lock
#   ID:        6d8c2e2d-9fde-ab17-cafe-d3a9b8f2c1de
#   Path:      example-tf-state-prod/networking/main.tfstate
#   Operation: OperationTypePlan
#   Who:       ci-runner-7b2@runner-pool
#   Created:   2024-06-24 03:11:42.234 +0000 UTC

The lock is owned by the dead CI job. The standard recipe is terraform force-unlock <id>. I ran it. It succeeded. I reran the plan. It was massive.

The massive plan

The plan had about 300 resources to modify. We had not changed anything. The reason was that the previous apply had gotten partway through, written some resources to AWS, and been killed before it could write the state file back to S3. So AWS had real resources that Terraform’s state did not know about. On the next plan, Terraform saw the .tf code, saw the state did not include those resources, and wanted to create them all. Of course creating an aws_route_table that already exists returns an “already exists” error, which on a normal run would be fine, except that 20 of the 300 were resources that AWS happily creates duplicates of (like some tag-only updates). We avoided apply, which is good.

This is the trap with force-unlock. The lock is a symptom. The real problem is “the state file on S3 does not reflect AWS reality”. force-unlock makes it possible to run commands but does nothing about the drift.

What I should have done

In retrospect the right recipe is:

Verify no one is actively running a plan or apply.
Look at the CloudTrail events for the lock owner’s IAM role in the time range of the abandoned run. This tells you what Terraform actually did in AWS before it got killed.
If the answer is “nothing”, force-unlock is safe.
If the answer is “something”, manually import the created resources into state, or revert them in AWS, before running another plan.

We did none of that. I just ran force-unlock and prayed.

How I actually recovered

After discovering the giant plan, I aborted. Then:

aws cloudtrail lookup-events \
    --lookup-attributes AttributeKey=Username,AttributeValue=terraform-ci \
    --start-time 2024-06-24T02:30:00Z \
    --end-time 2024-06-24T03:20:00Z \
    --max-items 100 > /tmp/events.json

jq -r '.Events[] | "\(.EventTime) \(.EventName) \(.Resources[0].ResourceName // "")"' /tmp/events.json | sort

I got a time-ordered list of what had been created. About 40 resources. I imported each into state:

terraform import 'aws_route_table.public["us-east-1a"]' rtb-0abc1234
terraform import 'aws_route_table.public["us-east-1b"]' rtb-0def5678
# ... etc

For the resources that had been partially configured (some routes but not all), I let Terraform plan and apply the missing parts. Total recovery time: about two hours.

What we changed afterwards

The CI runner for terraform now has memory.max set to 8 GB with OOMScoreAdjust=-500. If it is the only thing on the runner, it should not be the first thing killed. Also, we refactored the loops that caused the memory spike.
Our terraform wrapper script now traps SIGTERM and SIGKILL (well, SIGTERM; you cannot trap SIGKILL) and attempts a clean exit before the job is killed. It is best-effort. The real fix is not getting killed.
We added a pre-apply plan check:
```
terraform plan -out plan.bin
terraform show -json plan.bin | jq '.resource_changes | length'
```
If the count is more than some threshold and no one approved the big change, we refuse to apply in CI. A 300-resource plan should not happen silently again.
terraform force-unlock is removed from our runbook as a first step. The new runbook says: check CloudTrail. Check the state mtime. Decide whether force-unlock is safe. Then and only then run it.

Reflection

Terraform’s state model is a cache of AWS reality with a lock protecting the cache. When the lock is dirty, it is usually because the cache is also dirty. Treating the lock as the only thing to fix is a mistake I have now made in public twice and will not make a third time. Related: see my post on Ansible at scale: where it breaks for more “the tool is fine, my workflow was the bug” content.