A workflow for flaky tests that doesn't involve retry-until-green

Flaky tests are a tax on every engineer in the company. Not just the person whose PR hit the flake — everyone, because everyone eventually sees the red CI and has to decide whether to retry, investigate, or ignore. The decision itself is the cost.

We’d been living with about 4-6 known flakes in our suite for years. The tax was adding up. Six months ago we implemented a proper triage workflow and the number’s now roughly zero.

Here’s what we did.

The key shift: flakes are bugs

This sounds obvious but wasn’t, for us. Culturally, we treated flakes as “the test is bad, rerun it, move on.” That’s wrong. A flake is the test finding a real concurrency or ordering bug that happens in production too, just rarely. The test is doing its job. The response shouldn’t be “make the test quieter.”

Once we accepted that, the workflow followed.

The workflow

Step 1: auto-capture. CI marks any test that fails once and passes on retry as a “flake.” We collect the full logs, captured screenshots/DB dumps/whatever, and post them to a dedicated channel.

# pseudo-pytest-plugin
def pytest_runtest_logreport(report):
    if report.outcome == "failed" and is_retry():
        capture_artifacts_and_post(report)

Step 2: each flake gets a ticket. No “we’ll investigate later.” A ticket is created automatically with the test name, commit SHA, and link to the artifacts. The ticket is assigned to whoever owns the test.

Step 3: the test is quarantined. The test is moved to a separate “quarantine” suite that runs on every PR but doesn’t block the merge. This prevents the flake from blocking unrelated work while the owner investigates.

Step 4: investigation, with a deadline. The ticket has a 14-day SLA. If it’s not resolved in 14 days, the test gets deleted. Not skipped — deleted. The code coverage drop shows up on the dashboard. This is the forcing function.

Step 5: post-mortem lite. When the flake is fixed, the owner posts a one-paragraph “what it was” to the team channel. No formal doc. Just a habit of sharing.

What “fix” usually looks like

Over 6 months, we fixed about 40 flakes. Breakdown:

~40% were legitimate race conditions in the code. The test exposed them; fixing the code fixed the test.
~30% were test-setup races — the test didn’t wait for something it should have. Usually an await or sleep or a proper wait-for-condition. These reveal an opportunity to generalize a wait_for_stable_state helper.
~15% were environmental — shared state between tests, reliance on network resources, timezone assumptions. These often led to bigger refactors of the test fixtures.
~10% were actually bad tests that needed to be rewritten.
~5% were declared “can’t repro, can’t justify engineering time” and deleted.

That last category is important. Some flakes genuinely aren’t worth the investment to fix, and the right call is to delete the test. The workflow has to allow for this without guilt, or the SLA becomes a source of shame-driven ugly fixes.

What we stopped doing

Auto-retry as a general policy. We still retry once for real infrastructure failures (DNS, network blips, etc.) but not for test failures. A test failure is a test failure.

Shared mutable state in fixtures. A lot of our flakes were “test A left DB row X, test B assumed X didn’t exist.” We moved aggressively to fixtures that create their own state with randomized IDs.

@pytest.fixture
def user(db):
    u = User.objects.create(email=f"test-{uuid4()}@example.com")
    yield u
    u.delete()

Tests that sleep. time.sleep(0.5) is never the right answer. Either poll with a timeout or use an event. We wrote a codemod to find and flag all time.sleep calls in tests.

The metrics that matter

On a dashboard:

Flake rate (percentage of builds with at least one quarantined-test failure).
Time-to-resolution for flake tickets (median and p95).
Number of tests in quarantine at any given time.
Deletions per month.

The target is flake rate near zero, resolution time under 5 days, quarantine count under 3.

Reflection

The thing that surprised me most was that most flakes were real bugs. The framing of “flakes are noise” had been hiding actual concurrency problems in our code. After fixing the first dozen, we noticed production incidents for related issues also dropped. The tests had been trying to tell us something for years.

If you’re tolerating flakes because “we’ll fix them someday,” try the 14-day SLA. The forcing function of “fix it or delete it” is clarifying in a way that endless backlog triage is not.