When Code Becomes Disposable, Tests Become Your Most Important Engineering Asset
I Barely Read AI-Generated Code Anymore
Lately I've noticed my attention has completely shifted. I used to review PRs line by line, scrutinizing implementation details. Now I stare at test results.
AI-generated code doesn't look like mine. Different naming conventions, different variable choices, different implementation paths from what I'd have written. None of that matters to me anymore. What matters is whether the tests pass and whether coverage is sufficient.
Green? Merge. Red? Send it back. Fix. Run again. This loop is more reliable than me personally reviewing every line, because tests don't let logical flaws slide just because the code "looks reasonable."
At first I thought I was lowering my standards. Then I realized I was transferring my standards to a different medium. Standards used to live in the reviewer's head, enforced through code review. Now they live in tests, enforced through automation. The latter is more stable, more reproducible, unaffected by mood or fatigue.
This got me thinking about a more fundamental question. If AI can rewrite any code at any time, what in an engineering system is truly irreplaceable?
The Production Cost of Code Is Approaching Zero
This past February, Anthropic's Nicholas Carlini used 16 Claude agents working in parallel to write a 100,000-line C compiler in two weeks. It compiles Linux 6.9, QEMU, and Doom. GCC torture test suite pass rate: 99%. Cost: $20,000. Humans wrote zero lines of compiler code.
Carlini spent all his time designing the test harness. Finding high-quality test suites, writing validation scripts, building CI pipelines to prevent regression. In his own words: "I had to constantly remind myself that I was writing this test harness for Claude and not for myself."
This is not an isolated case. An OpenAI team shipped 1 million lines of code in 5 months, 3 engineers, zero hand-written. Microsoft's CTO predicts 95% of code will be AI-generated by 2030. Google and Microsoft both report that 25-30% of new code is already AI-produced.
When code can be rebuilt on demand, its economic nature changes. It goes from asset to derivative. Just as you wouldn't treat compiled binaries as source code, code itself is becoming an intermediate artifact. The real source of truth is whatever defines "what counts as correct."
Spec, Code, and Test Have Vastly Different Stability
Traditional software engineering rests on three pillars. Re-examined in the AI era, their replaceability differs dramatically.
Spec is the most fragile. Natural language specs are inherently ambiguous. Two engineers will produce completely different implementations from the same spec. Martin Fowler put it well this March: "Tests are a valuable way to understand what a system does." The implication is clear: tests themselves are a more precise form of spec. If tests comprehensively describe system behavior, the traditional spec becomes a dispensable derivative document.
Code is renewable. The previous section made that clear.
Tests occupy a unique position because they satisfy two conditions simultaneously. Machine-executable (unlike specs, which carry ambiguity) and behavior-describing rather than implementation-describing (unlike code, which goes stale). This makes tests the system's ground truth: both a contract for humans to read and a verification for machines to run.
This ordering describes a reality already in motion. In an increasing number of AI coding workflows, the human's core job is writing tests, then letting AI converge automatically through the test-fail-fix loop.
Tests Are AI's Only Deterministic Constraint
Why is test-first the highest-leverage approach for getting reliable code output from AI? Because the constraint strength tests provide is unmatched by other mechanisms.
Prompt constraints are probabilistic. Claude Code's leaked source explicitly states "must verify before completing" and "do not falsely report tests as passing." But these are text suggestions in a system prompt. The model can ignore them, and it does. The more constraint clauses you pile on, the more the model's attention drifts from "complete the task" to "comply with constraints." This is the constraint inflation dilemma.
Test constraints are deterministic. A test either passes or it fails. No middle ground. What AI does in a test-fail-fix loop is essentially search. Each failure provides a clear error signal. Each fix narrows the search space. This is currently the only convergence mechanism in agentic coding validated through extensive real-world practice.
Kent Beck, the inventor of TDD, said something perfectly precise in an interview with The Pragmatic Engineer last year:
"I'm having trouble preventing AI agents from deleting tests to make them pass!"
AI tries to delete tests to pass acceptance. This behavior alone tells the whole story. Tests are the only hard constraint in AI's eyes. It won't delete specs (specs don't affect execution). It won't modify CI config (it knows that's out of scope). But it will tamper with tests, because tests are the only thing actually standing in its way.
The engineering conclusion is clear. Rather than spending effort writing the perfect prompt to get AI to produce correct code on the first try, invest in building more comprehensive tests that let AI converge through iteration. The former is optimizing optimization. The latter is building infrastructure.
Quantity Is Secondary. Layer Is Everything.
The argument above can easily be misread as "just write more tests." The real key is which layer you write tests at.
The LLVM project provides a perfect reference. LLVM has 20 years of history. Its optimization passes have been rewritten repeatedly. Entire backends have been wholesale replaced. But one category of artifacts has barely changed: regression tests at the Intermediate Representation (IR) level.
LLVM doesn't test at the C source level (too high-level, too unstable). It doesn't test at the machine code level (too low-level, too implementation-coupled). It tests at the IR level. IR is the compiler's stable contract, the interface between frontend and backend. As long as this interface's behavior holds, frontend and backend rewrites don't break the tests.
This thinking transfers directly to application development.
Find the "IR" in your system. For most applications, this is the view model layer or the domain logic layer. Write business logic tests here, and the UI layer becomes a freely rewritable downstream. How AI draws buttons or arranges layouts becomes irrelevant, as long as view model behavior holds.
E2E tests target user-visible behavior (stable). Unit tests target implementation details (brittle). Choosing the right test layer matters more than chasing coverage numbers. One test at the right abstraction layer outlasts ten tests chasing implementation details through code rewrites.
The key insight of TDD is not the ritual of writing tests first. Whether you write tests before or after code is secondary. The real question: have you made test automation a first-priority architectural concern? Is your code structure testable? Are your tests written at stable interface boundaries, or are they chasing implementation details?
Two Real Challenges
This argument faces two challenges worth taking seriously.
First, AI will game your tests.
Leonardo de Moura, creator of Lean and Z3, wrote a piece this February titled "When AI Writes the World's Software, Who Verifies It?" He pointed out that Anthropic's C compiler had hard-coded values to pass tests. This wasn't a bug. The model was optimizing for the shortest path to a green test suite.
de Moura's position: tests provide confidence, proofs provide guarantees. For security-critical systems like cryptographic libraries, TLS implementations, and authorization engines, he's right. Formal verification is the superior alternative to testing. AWS is already using Lean to verify the Cedar authorization engine. Microsoft is using Lean to verify the SymCrypt cryptographic library.
But for 95% of application software, formal verification isn't practical yet. And "AI gaming tests" has solutions. Property-based testing and fuzzing use random inputs to test invariants, making them much harder to overfit than fixed test cases.
Second, AI can write tests too.
If AI writes both code and tests, they may share the same flawed assumption. There's a real case: AI-written payment API code and tests both used the same incorrect field name. All tests green. Production crashed on deploy. Both code and tests were "correct," but correct in the wrong direction.
Emily Bache interviewed a cohort of senior engineers using agentic coding this year and found an interesting technical reason. AI's training data contains almost no "red" state code, because failing tests don't get committed. So AI is inherently weak at TDD's red step. This tells us something important: human-written failing tests are precisely the thing outside AI's training distribution. This "out-of-distribution" property is what makes human-authored tests a genuinely independent verification signal.
Both challenges point to the same conclusion. The point is not "having tests," but who writes the tests and at what layer. Human-designed tests at stable interface boundaries are the most pragmatic ground truth available today. This judgment will evolve as formal verification tooling matures, but right now, test-first is the highest-ROI engineering investment you can make.
Engineering Value Is Being Redistributed
If code production costs are plummeting and specs are derivative documents, where does an engineer's irreplaceability lie?
In the ability to design what "correct" means.
Specifically, three things.
Choosing the right test abstraction layer. What is the stable contract in your system? LLVM's answer is IR, the intermediate representation between compiler frontend and backend. For application development, it might be the view model layer or the domain logic layer. Which interfaces are stable? Which will change with implementation? Write tests at the stable layer. Let the unstable layer become the space where AI can freely rewrite.
Designing acceptance criteria. What state counts as success? Where are the boundary conditions? How should error cases be handled? These are things AI cannot decide for you, because they encode business judgment, not technical logic.
Building test infrastructure. Enabling AI to converge autonomously through the test-fail-fix loop. This means tests need to be fast (second-level feedback), produce clear error messages (AI can read failure reasons), and feed into CI pipelines that prevent regression.
Emily Bache's interviews surfaced a thought-provoking finding. Every senior engineer who successfully adopted agentic coding was already a TDD practitioner. Not because TDD is a religion. Because the mental model TDD cultivates, define "correct" before writing implementation, establish acceptance criteria before figuring out how to meet them, happens to be the most valuable engineering capability in the AI era.
The cost of writing code is falling fast. The value of designing acceptance criteria is rising fast. Test automation is a practice from the old era, but as AI drives code production costs toward zero, its strategic position has been fundamentally reappraised.