CI was green. Prod was broken. Those are the two sentences I never want to type in the same post-incident doc.

What we saw

A PR that refactored some internal library code went through review, tests passed, deploy went out, and within ten minutes half our background jobs were erroring out with import errors. Classic “module moved, import didn’t update” — except the tests had absolutely imported the module and passed.

We rolled back in about 4 minutes. Then we spent 6 hours figuring out how CI had been lying.

The actual cause

GitHub Actions’ actions/cache had cached our Python wheels directory. The cache key was based on requirements.txt content. Our refactor did not change requirements.txt. It changed internal packages that were included in the wheel cache from a previous build.

When the new PR ran CI:

  1. Cache restore put old wheels (including old versions of our internal package) into ~/.cache/pip.
  2. pip install was a no-op — all the wheels were already there.
  3. Tests ran against the old wheels because the site-packages install was from cache.
  4. The tests that imported the refactored module imported the cached old version and passed.

In prod, the deploy built fresh and actually loaded the new module, which had import errors.

Why the test suite didn’t catch it anyway

You’d think some test that specifically tested the new code would fail. It did — in the sense that no test found an import error. Because the tests themselves were importing the old cached module too, not the source.

The new module had a new function that was supposed to be imported by a caller. The caller still worked because it was also in the cache in its old form. The tests were only testing old code against itself. Perfect green.

The fix

The cache key needs to encode “we might have changed internal code too.” A few options:

  1. Include the hash of the whole repo in the cache key. Correct, but nearly useless — the cache gets invalidated on every commit.

  2. Include the hash of specific source directories:

    - uses: actions/cache@v4
      with:
        path: ~/.cache/pip
        key: pip-${{ hashFiles('requirements.txt', 'src/**/*.py') }}
    
  3. Don’t cache installed packages at all, cache only the pip wheel cache. Reinstall from wheels on every build, which is fast.

We went with option 3. Cache misses on the pip index are the expensive part; unpacking wheels is fast. This avoids the whole “site-packages is stale” problem.

- uses: actions/cache@v4
  with:
    path: ~/.cache/pip
    key: pip-wheels-${{ runner.os }}-${{ hashFiles('requirements.txt') }}

- run: pip install --no-deps -r requirements.txt

For internal packages we install them via pip install -e . which is always fresh.

Secondary fixes

Two more things we did:

  • Added a CI smoke check that does python -c "import ourapp.newly_renamed_module" at the end, failing the build if the import is broken. This is dumb and trivial, and it would have caught the original issue.
  • Changed our deploy to fail-fast on import errors rather than marking jobs as “started, then crashed” and relying on monitoring to catch it. python -c "import ourapp.main" before handoff.

GitHub Actions caches are scoped by branch (with fallback to default branch). A cache created by a feature branch can be restored by the same branch or a PR targeting the default branch. The interaction with concurrency groups, PR merges, and branch deletion can create scenarios where you get a cache restored from a very different branch’s build, especially right after merge.

There’s a great debugging trick: add echo "cache key: $CACHE_KEY" and ls -la ~/.cache/pip/wheels/ to your CI, and you can see exactly what you got. We caught two more latent issues this way.

Reflection

CI caches are one of those features where the happy path is so fast and smooth that you forget the correctness assumptions they make. The main one: “if the cache key is the same, the cached content is interchangeable with a fresh build.” That’s only true if you encode everything that affects the output in the cache key. Everyone under-encodes.

If you use GitHub Actions cache, take 20 minutes and re-audit your cache keys with this lens. You will probably find at least one that’s lying to you.

Related: Flaky tests triage workflow.