JC Ghiringhelli

The Flea Game

JC Ghiringhelli — Mon, 27 Apr 2026 22:43:49 GMT

I built a flea game for my daughter. The game was her idea, the implementation was mine.

The game is this: a dog is sitting on the floor. Fleas are moving through its fur. The player has three tools. Tweezers — precise, one flea at a time; miss, and the flea scatters, pulling nearby fleas with it. A comb — sweep a straight line through the fur and catch everything in its path, but only straight, only so many at once. A spray — freeze the fleas temporarily, create a window to work. Each scenario: more fleas, darker fur, faster movement. The game escalates in difficulty.

It took a few hours of the weekend. It might never be seen by anyone outside the family.

Now ask yourself: should this program be correct?

Not “good enough.” Not “works on my machine.” Correct. Free of state corruption when a frozen flea’s timer expires mid-sweep. Free of race conditions when the scatter effect fires while the comb is still resolving. Free of invalid transitions when the pause button is pressed during the frame a flea is caught.

The answer, for most of the history of software, was: *of course not. That’s for pacemakers, ATMs and aircraft. This is a flea game.*

That was a reasonable answer. It was also a choice that shaped everything that followed.

---

## The Agreement Nobody Made

There is a body of work — built over roughly 2,376 years, from Aristotle’s logic through to the formal methods community of the late twentieth century — that tells you exactly how to make programs correct. Not probably correct. Not correct for the test cases you thought of. Provably, structurally, mathematically correct.

Tony Hoare gave us contracts: preconditions, postconditions, invariants. If you state what must be true before a function runs and what must be true after, the machine can check whether the implementation honors those promises.

Robin Milner gave us types that prevent whole categories of error from existing. Not catching them — preventing them. A well-typed program cannot have certain bugs. Not in the way a tested program has fewer bugs. In the way a circle cannot have corners.

Naoki Honda gave us session types: protocols that guarantee two communicating processes will always be in compatible states. No deadlock. No protocol violation. Not as a property to be tested but as a structural guarantee.

Jean-Yves Girard gave us linear types: the ability to say that a resource must be used exactly once — not zero times, not twice, once. A ticket that cannot be duplicated. A payment that cannot be reversed after it clears.

Dorothy Denning gave us information flow analysis: the ability to prove that sensitive data can only travel along permitted paths. Not by checking every line of code for leaks but by making unauthorized flows structurally impossible to express.

Every one of these tools was a genuine breakthrough. And none of them ever reached the flea game.

Because deploying them required the engineer to have read the papers. To understand the type theory. To annotate the code with the right labels in the right places. To maintain those annotations as the code evolved. The annotation burden was not recoverable at human scale. A weekend project could not absorb the cost of a graduate education in formal methods.

So a global decision was made — not in any meeting, not by any committee, not consciously — that formal correctness would be reserved for systems where lives were at stake. Everything else would ship on convention and hope.

That decision is now over.

---

## What Changed

The AI assistant has read the papers.

Not metaphorically. These systems trained on the formal methods literature, on every implementation of Hoare logic and linear types and session type theory ever published. They have internalized the patterns well enough to apply them correctly to new situations.

What they are missing is not the theory. They have the theory.

What they are missing is *your system*. They do not know what a flea is in your program. They do not know what frozen means in your animation model, what scatter means in your physics engine, what the comb’s line constraint means for your hit-detection. Every session starts from scratch. The theory is universal; your system is not.

Generative Specification is the bridge.

Write a specification — a precise description of what your system must be. What a flea is. What frozen means. What the game must never allow, and what it must always guarantee. Not code. Not implementation. The territory, named precisely.

The AI reads that specification at the start of every session. It holds the theory. You hold the domain knowledge. The specification is the handshake between them.

When you specify that the game session has exactly five valid states — menu, scenario select, playing, paused, and ended — and that the transition from playing to ended can only happen through catching all fleas or expiring the timer, the AI does not just write a switch statement. It builds a declarative state machine with typed transitions and no reachable invalid state. In TypeScript, the machine refuses to process an undefined event — invalid transitions are practically impossible. In Rust, with Honda’s session types, they would be formally impossible: the compiler refuses to compile the invalid transition. The specification is the same. The formal weight behind it scales with the language.

When you specify that a frozen flea carries an expiry timestamp and must transition back to active exactly once, the AI builds a component with a countdown the system removes when it expires. In TypeScript, this is a convention: a dedicated system decrements the timer, removes the component, and a test suite verifies it happens exactly once. In Rust, the same specification produces Girard’s linear type: the compiler enforces that the resource is consumed exactly once, not by convention but by the type system. You named the constraint. The language determines how strictly it is held.

When you specify that a miss with tweezers triggers a scatter effect that moves nearby fleas to new positions, and that the first miss in the tutorial scenario does not trigger scatter, the AI builds a system that checks those preconditions before acting and produces a bounded, deterministic outcome. In TypeScript, a test verifies the postcondition holds. In Rust, the same specification lets you formally annotate the postcondition — what Hoare formalized — and have it verified. You did not need to know these names. You named the territory. The AI built structures that express the same guarantees those theories formalize. The language determines how strictly they hold.

---

## What This Means

The flea game was built in a few hours. Under Generative Specification discipline — a specification written before a line of code, behavioral contracts for every tool and state transition, a test harness derived from those contracts and running against the live game. The specification took longer to write than the code took to generate, and even that was assisted by the AI.

That is not a boast about AI speed. It is a claim about where the work went. The specification carries the complexity now. The implementation follows from it.

There is no minimum project size for correctness anymore. The inventory tool for a small shop. The scheduling script for a weekend sports league. The game an engineer builds for her daughter on a Saturday afternoon. All of them can now have the structural guarantees that were previously reserved for systems where people would die if they failed.

This is not a claim about AI being magical. It is a claim about where the cost went. The annotation burden — the thing that made formal methods inaccessible at human scale — was the cost of *learning the theory and applying it correctly by hand*. That cost has moved. The AI carries the theory. The practitioner carries the domain. The specification connects them.

The deal that software engineering made — correctness for important things, convention and hope for everything else — was made because the alternative was too expensive. The alternative is no longer expensive.

---

## The Practitioner Who Doesn’t Know She Is One

There is a second thing the flea game story reveals.

I wasn’t thinking about formal methods when I was building it. I was thinking about what the game should do. My daughter wasn’t thinking about formal methods when she described it. She was thinking about what would be fun. What she gave me was a spec: dog, grass, fleas, three tools with distinct constraints, escalating scenarios. The naming of the territory. That is all GS requires.

That clarity — the ability to say precisely what a system must be — is the only skill that the new model requires. Not Hoare. Not Milner. Not Honda or Girard or Denning. The ability to name the territory.

That is a human skill. It always was. The reason it was not sufficient before was that between naming the territory and building the system, there was a vast mechanical layer that required specialized training to cross. That layer has been crossed. The naming is what matters.

The formal tradition spent 2,376 years building the machinery that sits on the other side of the specification. All of it waiting for a practitioner who could describe what she wanted clearly enough for the machinery to engage.

The same principle — that naming the territory precisely is sufficient for the machinery to engage — turns out to apply beyond correctness. It applies to maintainability, to reviewability, to the ability to change a system without breaking what it promised. TDD, SOLID, hexagonal architectures. That is a different essay.

The flea game is enough.

---

The Ambient Engineer

JC Ghiringhelli — Thu, 19 Mar 2026 04:53:21 GMT

I have not written code in months. Nor application, nor testing, nor deployment, nor command line scripts.

I have fifteen to twenty projects open at any time, six or seven active in any given session, the others paused at a deploy, a test run, a decision I have not made yet. When one pauses, I move to the next. What I do now is specify: I describe what should be built, why, to what standard, in what shape, with what constraints and failure modes. The model builds it, executes the commands, writes the configuration files, calls the APIs.

The monitor screen is the wrong interface for this.

Not because screens are bad. The role that remains in AI-assisted development is an orchestrator who holds intent and decides what comes next, and it does not require a workstation. It requires attention, judgment, and the ability to communicate precisely. I spent twenty years learning to speak the machine’s language: high level languages, type systems, compiler flags, a new language every two years, a new syntax every project, 30,000 line XSLTs I had to read end to end just to understand what transformation was happening. The machine could not meet me where I think. I had to go to where it lived. That constraint is dissolving.

What replaces it is three physical layers: a thin display in the field of view, a pocket unit for routing and bridging, and compute wherever it lives. Mini PCs, cloud instances, laptops. Incoming signals arrive already summarized to significance, responses go back by voice without switching physical context. The hardware already exists. What does not exist yet is the orchestration layer: the specification surface that determines which signals rise to attention, at what frequency, prioritized by urgency, clustered and ordered by what actually needs a human decision.

Imagine gesture-minimizable overlays showing messages alongside summaries and recommended actions, and being able to reply on content and tone just by voicing it. Dynamic calls with real-time captions, translations, automatic action items. Diagrams and decisions saved to folders you summon on command. An ordered queue of inputs needed from your many AI assistants who are building your projects and need occasional but precise guidance. Charts, slides, art, music, video, articles, podcasts, each shaped by describing and constraining the outcome across iterations until it matches what you desire.

This is the same discipline I apply to software, now applied to the surface that matters most: my own attention. It lets me step away from the screen, move through the world, socialize, exercise, think, while the critical arrives immediately and everything else waits in organized flows rather than noise. It enables the dispersed, asynchronous work that is natural when AI is executing your specification and needs occasional direction, without foreclosing the periods of deep focus that now resolve on much faster cycles.

When execution cost approaches zero, the limiting constraint on what gets built shifts from effort to the ability to specify correctly and the quality of your attention. That is a more democratic bottleneck. Not perfectly democratic, because specifying correctly is a real skill and skills are unevenly distributed, but the barrier becomes epistemic rather than economic. Understanding replaces capital as the gate.

I published the theory behind this discipline this week: Generative Specification (https://doi.org/10.5281/zenodo.19073543). The ambient interface is where it leads.

This is the year the interface catches up. Sixty years of screen, keyboard and mouse. It all resolves to gesture and specifying.

The methodology behind this: Generative Specification (https://doi.org/10.5281/zenodo.19073543). The tool that implements it: forgecraft-mcp (https://github.com/jghiringhelli/forgecraft-mcp). The workshop for teams: forgeworkshop.dev (https://forgeworkshop.dev).

I Had 93% Test Coverage. Then I Ran Mutation Testing.

JC Ghiringhelli — Thu, 19 Mar 2026 03:32:20 GMT

Line coverage measures execution: what percentage of your code runs when the tests run. 93.1% means 93.1% of lines execute. That number is correct. It says nothing about whether any test catches a failure.

Execution vs. Detection

A test that calls calculateTax(100) and asserts result !== null gives you full line coverage of that function. It will not catch a wrong tax rate, a sign error, or a rounding failure. Change the return value to anything non-null and the test still passes.

That test covers the line. It does not test the behavior.

What the Experiment Found

In the multi-agent adversarial experiment from the Generative Specification white paper (https://doi.org/10.5281/zenodo.19073543), the treatment project produced a test suite with 93.1% line coverage.

Then Stryker was applied to the services layer: 116 mutants, each a realistic code change (flipped boolean, swapped operator, removed return value). Stryker checks whether any test fails when the code is mutated.

Baseline mutation score: 58.62% MSI.

A 34-point gap. That is the fraction of the codebase where a bug can be introduced, your tests will pass, and CI will go green.

Three rounds of targeted assertion improvements followed, guided by the surviving mutants. Each round replaced presence checks with correctness checks and added boundary conditions.

After three rounds: 93.10% MSI, matching line coverage exactly.

When both numbers converge, every covered line is verified. The tests no longer just execute the code; they detect when it breaks. The line coverage number turned out to be exactly the quality level the tests needed to reach.

The same pattern appeared in the Shattered Stars case study in the paper: line coverage 80%, mutation score 58%, a 22-point gap. High line coverage alongside a low mutation score is the signature of tests written to satisfy a metric.

Why This Happens

This is not unique to AI-generated tests. Human-written suites built after the implementation show the same pattern. Tests optimized for a coverage gate optimize for the gate.

The structural fix is TDD: write the test first, make it fail, write the code that makes it pass. A test written before the implementation cannot pass until the implementation is correct. The model can follow TDD instructions. It does not do so spontaneously.

The Action

Find the mutation tool for your stack. Stryker is JavaScript and TypeScript only. Pick the right one for your language.

- JS / TypeScript — Stryker (https://stryker-mutator.io/) — npx stryker run

- Python — mutmut (https://mutmut.readthedocs.io/) — mutmut run

- Java / Kotlin — PIT (https://pitest.org/) — mvn pitest:mutationCoverage

- C# / .NET — Stryker.NET (https://stryker-mutator.io/docs/stryker-net/introduction/) dotnet stryker

- Go — go-mutesting (https://github.com/zimmski/go-mutesting) — go-mutesting ./...

- Ruby — mutant (https://github.com/mbj/mutant) — mutant run

- Rust — cargo-mutants (https://mutants.rs/) — cargo mutants

Run it on your services layer. Look at the gap between your line coverage and your mutation score.

If both numbers are close, be proud. Most codebases do not start there. Every covered line is breaking a test when it changes. That is what a test suite is for.

If there is a gap, give your AI assistant the list of surviving mutants and ask it to strengthen the assertions. The same model that wrote weak tests can write better ones; it needs the mutation report to know what “better” means.

The full experiment data — all eight conditions, mutation scores, raw results — is in the Generative Specification paper (https://doi.org/10.5281/zenodo.19073543). The replication is fully reproducible: Docker, an Anthropic key, and 20 minutes.

genspec.dev (https://genspec.dev)

Why Your AI Coding Assistant Produces Incoherent Architecture (And What to Do About It)

JC Ghiringhelli — Wed, 18 Mar 2026 13:01:26 GMT

I’ve been building software for almost twenty years. For most of that time, the job was understanding the problem, then writing the code that solves it.

Sometime in the last two years, the second half of that sentence disappeared. I haven’t written a single line of code for over 9 months.

I published a paper this week that explains why — not as a productivity story, but as a formal one. The argument runs through Chomsky’s grammar hierarchy, Martin’s sequence of programming paradigm shifts, and six production systems I built under the methodology I’m proposing.

The core claim: we are at the first programming discipline of the pragmatic dimension. Every prior discipline constrained what was permitted (syntax) or what communicated to a reader with context (semantics). Generative Specification is the first one that constrains what a stateless reader, a model with no prior context, can derive. The discipline is not about AI features. It’s about what you have to externalize for AI to work correctly.

The failure mode it addresses is drift: architecturally incoherent output generated at AI speed, propagating across every session that inherits the corrupted context. Every person I’ve spoken with about this in the last year has seen it. The discipline that addresses it is available now.

Paper: https://doi.org/10.5281/zenodo.19073543

If you’re building with AI-assisted tools and the architecture is getting harder to control — not easier — this is the argument that explains why. The methodology is documented, the experiments are reproducible, and the tool that implements it is open source.

Start at genspec.dev (https://genspec.dev).