[The Null Hypothesis]

Testing in the Age of AI Agents

Published
Reading time
7 min read
Category
philosophy
LLM

AI agents don’t just write code faster. They make rewriting trivial. Your codebase becomes fluid, reshaping itself as fast as you can describe what you want.

But when everything is in flux, how do you know the features still work?

Something needs to hold the shape while everything inside it moves. That something is your tests. Not tests that document how the code works today, but tests that define what it must always do.

The Contract Principle

The obvious purpose of tests is “catching bugs.” But that’s incomplete.

Tests define what “correct” means. They’re a contract: this is what the system must do. Everything else is negotiable.

Kent Beck captured this precisely: tests should be “sensitive to behaviour changes and insensitive to structure changes.” A test that breaks when behaviour changes is valuable. A test that breaks when implementation changes, but behaviour stays the same, is actively punishing you for improving your code.

The difference is stark in practice:

// Testing implementation - breaks when you refactor
test('calls navItemVariants with correct params', () => {
  const spy = vi.spyOn(styles, 'navItemVariants');
  render(<NavItem to="/orders">Orders</NavItem>);
  expect(spy).toHaveBeenCalledWith({ active: false });
});

// Testing contract - survives any rewrite
test('renders as a link to the specified route', () => {
  render(<NavItem to="/orders">Orders</NavItem>);
  const link = screen.getByRole('link', { name: /orders/i });
  expect(link).toHaveAttribute('href', '/orders');
});

The first test knows the component uses a function called navItemVariants. Tomorrow, you might rename that function or eliminate it entirely. The test breaks. The component still works.

The second test knows only what matters: there’s a link, it goes to /orders, it says “Orders.” Rewrite the entire component. Swap the styling system. As long as users can click a link to their orders, the test passes.

$ npm test

❌ FAIL src/components/NavItem.test.tsx
  ✕ calls navItemVariants with correct params
  ✕ passes active prop to styling function
  ✕ renders with expected className

3 tests failed. The component works perfectly. 🙃

The tests haven’t caught a bug. The behaviour is identical. You’re just paying a tax on change.

The Black Box

Treat every module like a black box. You know what goes in. You know what should come out. What happens inside is none of your tests’ business.

This clarifies what to mock. External systems (APIs, databases, third-party services) exist outside your black box. Mock those. Your own modules exist inside. Don’t mock those; let them run.

// Mock external systems - they're outside your control
vi.mock('~/api/client', () => ({
  fetchUser: vi.fn().mockResolvedValue({ name: 'Test User' }),
}));

// Don't mock your own code - let it run
// ❌ vi.mock('~/components/ui/NavItem');
// ❌ vi.spyOn(myModule, 'internalHelper');

When you mock your own code, you’re encoding the current implementation into your tests. When the implementation changes, your mocks become lies. They describe a structure that no longer exists, and your tests pass while your code breaks.

A useful heuristic: before committing a test, imagine handing the module’s specification to a developer who’d never seen your code. They implement it from scratch, differently. Would your tests pass? If yes, you’ve tested the contract. If no, go back and fix the test.

The Circular Verification Problem

Here’s where AI changes everything.

Tests exist to verify that code is correct. If AI writes both the code and the tests, what verifies what? The test was supposed to catch AI mistakes. But AI wrote the test. You’ve created a loop with no external reference point.

AI writes code → AI writes tests → tests pass → “correct”?

Black box tests break this circularity because they’re human-auditable. When a test says “there’s a link that goes to /orders,” you can read that assertion and verify it matches the requirement. You don’t need to understand implementation details.

Implementation-coupled tests aren’t auditable this way. To verify the test is correct, you’d need to understand the implementation it’s coupled to. You’re back to trusting AI about AI’s work.

This suggests specific rules:

Treat assertions as immutable. AI can refactor how a test runs: the setup, the helpers, the structure. AI should not change what a test asserts without explicit human approval. The assertion is the contract.

// AI can change this (setup)
const user = await setupTestUser({ role: 'admin' });

// AI should NOT change this (assertion) without approval
expect(user.canAccessDashboard()).toBe(true);

Failing behaviour tests require human attention. When a contract-level test fails, AI shouldn’t auto-fix it. The failure is information. A human must decide: is this a real bug, or did requirements change?

Separate creation from modification. AI drafting new tests for new features is relatively safe. AI modifying existing tests is riskier. New tests add coverage. Modified tests might silently remove it.

What Not to Test

Simple, obvious code doesn’t need tests. A component that renders a string as a heading doesn’t need a test proving it renders a heading. A utility that concatenates paths doesn’t need a test for every combination.

Test complex logic. Test edge cases. Test error handling. Test anything where a bug would be non-obvious or expensive to find later.

// Congratulations, you've tested JavaScript
test('banana equals banana', () => {
  expect('🍌').toBe('🍌'); // ✅ PASS
});

Don’t test that React renders React components. Don’t test that TypeScript types are correct. Your test suite isn’t a proof of correctness; it’s a net that catches bugs that matter.

This restraint has a benefit: a smaller, focused test suite is easier to audit. When every test has a clear purpose, you can review what AI wrote and verify it matches intent.

The Coverage Trap

Coverage measures execution, not intent. A test that executes a line of code isn’t necessarily testing that the line does what it should.

Worse, coverage as a target incentivises exactly the wrong kind of tests. Need to hit 80%? Write tests that spy on every function, assert on every intermediate value. You’ll hit your number. You’ll also create a test suite that breaks whenever anyone improves the code.

// Written for coverage, not for value
test('increases coverage', () => {
  const result = processOrder(mockOrder);
  expect(processOrder).toHaveBeenCalled(); // So what?
  expect(result).toBeDefined(); // Still nothing
});

// Written for behaviour
test('completed orders update inventory', () => {
  const order = createOrder({ items: [{ sku: 'ABC', quantity: 2 }] });
  processOrder(order);
  expect(getInventory('ABC')).toBe(initialStock - 2);
});

The real question isn’t “how much code did my tests execute?” It’s “would my tests catch a bug that matters?”

A Philosophy for Flux

Tests are how you know code is correct. When both code and tests are fluid, when AI can change either at will, you lose the ability to verify anything. The test that passed yesterday means nothing if it was rewritten to match today’s code.

The philosophy is simple:

Test what the code does, not how it does it.

Tests become specifications, not surveillance. They define what matters, not document what exists. And because they encode observable behaviour rather than internal structure, they remain human-auditable even when AI writes them.

When code is in constant flux, tests are your fixed point. They’re stable not because change is expensive, but because they define what “correct” means. Without that fixed point, you have no way to know if your fluid code is flowing in the right direction.