TestForge | Aidevops | 📊 Plogger ✍️ Blog 📚 Docs
plogger

AI DevOps Korea

Turn AI service development and operations into one improvement loop

Aidevops.kr covers LLMOps, RAG, agents, observability, evaluation, and cost-performance optimization for production AI services.

Flaky Test Triage Playbook

· Updated Apr 28

Flaky tests are not just annoying. Once the team stops trusting failures, the whole CI pipeline loses its ability to protect production.

First classify the failure

Not every flaky test has the same cause. Common classes include:

  • timing and async race conditions
  • shared test data contamination
  • environment instability
  • weak selectors or brittle assertions

Classification matters because the fix strategy is different for each one.

Triage before you “fix”

When a flaky test appears, answer these questions first:

  • how often does it fail
  • does it block deploys
  • is the underlying product behavior also unstable
  • can the failure be isolated to one environment or browser

That prevents teams from spending a day on a low-value symptom while a more dangerous flaky path remains in CI.

Contain blast radius quickly

If a flaky test is high-noise and low-signal, contain it:

  • quarantine it with visibility
  • lower its release-gate weight temporarily
  • assign an owner and due date

The mistake is leaving it unowned while everyone silently clicks rerun.

Fix the root cause, not the retry count

Retries are sometimes useful for diagnosis, but repeated retries should not become the permanent solution. Strong fixes usually involve:

  • deterministic test data
  • explicit readiness checks
  • narrower assertions
  • removing hidden cross-test state

Track flakiness as an operational metric

Measure:

  • top failing tests
  • rerun frequency
  • quarantine age
  • build time lost to retries

A reliable pipeline is built by treating test trust as a product, not as a side effect.

Continue Reading

Related posts

Next Path

Keep exploring this topic as a system