I Had 93% Test Coverage. Then I Ran Mutation Testing.

The number you trust to say "safe to refactor" might be measuring the wrong thing.

Mar 19, 2026

Line coverage measures execution: what percentage of your code runs when the tests run. 93.1% means 93.1% of lines execute. That number is correct. It says nothing about whether any test catches a failure.

Execution vs. Detection

A test that calls calculateTax(100) and asserts result !== null gives you full line coverage of that function. It will not catch a wrong tax rate, a sign error, or a rounding failure. Change the return value to anything non-null and the test still passes.

That test covers the line. It does not test the behavior.

What the Experiment Found

In the multi-agent adversarial experiment from the Generative Specification white paper (https://doi.org/10.5281/zenodo.19073543), the treatment project produced a test suite with 93.1% line coverage.

Then Stryker was applied to the services layer: 116 mutants, each a realistic code change (flipped boolean, swapped operator, removed return value). Stryker checks whether any test fails when the code is mutated.

Baseline mutation score: 58.62% MSI.

A 34-point gap. That is the fraction of the codebase where a bug can be introduced, your tests will pass, and CI will go green.

Three rounds of targeted assertion improvements followed, guided by the surviving mutants. Each round replaced presence checks with correctness checks and added boundary conditions.

After three rounds: 93.10% MSI, matching line coverage exactly.

When both numbers converge, every covered line is verified. The tests no longer just execute the code; they detect when it breaks. The line coverage number turned out to be exactly the quality level the tests needed to reach.

The same pattern appeared in the Shattered Stars case study in the paper: line coverage 80%, mutation score 58%, a 22-point gap. High line coverage alongside a low mutation score is the signature of tests written to satisfy a metric.

Why This Happens

This is not unique to AI-generated tests. Human-written suites built after the implementation show the same pattern. Tests optimized for a coverage gate optimize for the gate.

The structural fix is TDD: write the test first, make it fail, write the code that makes it pass. A test written before the implementation cannot pass until the implementation is correct. The model can follow TDD instructions. It does not do so spontaneously.

The Action

Find the mutation tool for your stack. Stryker is JavaScript and TypeScript only. Pick the right one for your language.

- JS / TypeScript — Stryker (https://stryker-mutator.io/) — npx stryker run

- Python — mutmut (https://mutmut.readthedocs.io/) — mutmut run

- Java / Kotlin — PIT (https://pitest.org/) — mvn pitest:mutationCoverage

- C# / .NET — Stryker.NET (https://stryker-mutator.io/docs/stryker-net/introduction/) dotnet stryker

- Go — go-mutesting (https://github.com/zimmski/go-mutesting) — go-mutesting ./...

- Ruby — mutant (https://github.com/mbj/mutant) — mutant run

- Rust — cargo-mutants (https://mutants.rs/) — cargo mutants

Run it on your services layer. Look at the gap between your line coverage and your mutation score.

If both numbers are close, be proud. Most codebases do not start there. Every covered line is breaking a test when it changes. That is what a test suite is for.

If there is a gap, give your AI assistant the list of surviving mutants and ask it to strengthen the assertions. The same model that wrote weak tests can write better ones; it needs the mutation report to know what “better” means.

The full experiment data — all eight conditions, mutation scores, raw results — is in the Generative Specification paper (https://doi.org/10.5281/zenodo.19073543). The replication is fully reproducible: Docker, an Anthropic key, and 20 minutes.

genspec.dev (https://genspec.dev)

JC Ghiringhelli

Discussion about this post

Ready for more?