TDD with AI: Claude Writes Tests First, Then the Implementation
What is TDD with AI?
TDD with AI is a test-driven development workflow where an AI assistant like Claude Code generates the failing test suite first, based on a natural-language description of the module, and then implements the code to make those tests pass. This removes the main friction of classic TDD — designing APIs and edge cases before any implementation exists.
TL;DR
- -TDD stalls because writing tests first requires knowing the API before the implementation exists — AI removes this block
- -Workflow: describe the module in a prompt → Claude generates failing tests → Claude implements against them → refactor
- -Key prompt rule: always include 'Do not write the implementation' — otherwise Claude skips TDD and writes code directly
- -AI-generated tests catch more edge cases than developer-written ones: null, empty arrays, concurrent access, timeouts
- -The test file becomes the specification — review it before the implementation, not after
Most teams know TDD works. Few actually practice it. The friction is real: writing a test before the implementation means thinking through the API, edge cases, and module contract before a single line of logic exists. That’s cognitively expensive.
An AI assistant removes this barrier. Claude Code turns test-first from a discipline exercise into a natural development workflow. Tests become the specification, and the implementation is generated against them.
Why TDD Stalls Without AI
The classic TDD cycle: Red → Green → Refactor. Write a failing test, make it pass with minimal code, then refactor. In practice, the cycle breaks at the first step.
A developer sits down to write a test for a new module. But writing the test requires defining the interface, which requires understanding the architecture, which requires at least mentally sketching the implementation. The circle closes on itself.
Three concrete problems:
Designing an API from scratch. A test requires calling a function that doesn’t exist yet. What arguments does it take? What does it return? What errors does it throw? Without a prototype implementation, these questions hang in the air.
Edge cases in the dark. Good tests cover boundary conditions. But boundary conditions only become visible once you understand the implementation. Empty arrays, null, timeouts, concurrent access — these are all much easier to spot when looking at code rather than a blank test file.
Motivation. A test that fails with “function not found” gives no sense of progress. The developer switches to the implementation “just to quickly verify the idea” — and never returns to test-first.
AI removes all three barriers. Claude generates a test specification from a description of the task, suggests edge cases based on data types and patterns, and creates an implementation strictly against the test contract.
Workflow: Test-First with Claude Code
The workflow has four steps. Each step has a concrete prompt and expected outcome.
Step 1. Specification via Tests
Claude Code prompt:
Write tests for the parseSchedule module.
Input: a free-form schedule string ("every Monday at 10:00", "daily at 9am").
Output: an object { dayOfWeek?: number, hour: number, minute: number, recurring: boolean }.
Cover: valid strings in EN/RU, invalid input, empty string, null.
Use vitest. Do not write the implementation.
Claude generates the test file:
import { describe, it, expect } from 'vitest';
import { parseSchedule } from './parseSchedule';
describe('parseSchedule', () => {
describe('english input', () => {
it('parses "every Monday at 10:00"', () => {
const result = parseSchedule('every Monday at 10:00');
expect(result).toEqual({
dayOfWeek: 1,
hour: 10,
minute: 0,
recurring: true,
});
});
it('parses "daily at 9am"', () => {
const result = parseSchedule('daily at 9am');
expect(result).toEqual({
hour: 9,
minute: 0,
recurring: true,
});
});
it('parses "Friday 14:30"', () => {
const result = parseSchedule('Friday 14:30');
expect(result).toEqual({
dayOfWeek: 5,
hour: 14,
minute: 30,
recurring: false,
});
});
});
describe('russian input', () => {
it('parses "каждый понедельник в 10:00"', () => {
const result = parseSchedule('каждый понедельник в 10:00');
expect(result).toEqual({
dayOfWeek: 1,
hour: 10,
minute: 0,
recurring: true,
});
});
it('parses "ежедневно в 9:00"', () => {
const result = parseSchedule('ежедневно в 9:00');
expect(result).toEqual({
hour: 9,
minute: 0,
recurring: true,
});
});
});
describe('edge cases', () => {
it('throws on empty string', () => {
expect(() => parseSchedule('')).toThrow('Invalid schedule format');
});
it('throws on null', () => {
expect(() => parseSchedule(null as any)).toThrow();
});
it('throws on unrecognized format', () => {
expect(() => parseSchedule('maybe sometime')).toThrow(
'Invalid schedule format'
);
});
it('handles extra whitespace', () => {
const result = parseSchedule(' daily at 9am ');
expect(result).toEqual({
hour: 9,
minute: 0,
recurring: true,
});
});
});
});
All tests fail — parseSchedule doesn’t exist. That’s expected: the Red phase is complete.
What just happened: the tests locked in the contract. Input type, output structure, behavior on edge cases, supported formats. This file is already a full specification for the module.
Step 2. Minimal Implementation
Prompt:
Write an implementation of parseSchedule that passes all tests in parseSchedule.test.ts.
Minimal code. No extra features. Only what the tests require.
Claude reads the test file, extracts the expectations, and generates the implementation:
const EN_DAYS: Record<string, number> = {
sunday: 0, monday: 1, tuesday: 2, wednesday: 3,
thursday: 4, friday: 5, saturday: 6,
};
const RU_DAYS: Record<string, number> = {
воскресенье: 0, понедельник: 1, вторник: 2, среда: 3,
четверг: 4, пятница: 5, суббота: 6,
};
interface Schedule {
dayOfWeek?: number;
hour: number;
minute: number;
recurring: boolean;
}
export function parseSchedule(input: string): Schedule {
if (!input || typeof input !== 'string') {
throw new Error('Invalid schedule format');
}
const normalized = input.trim().replace(/\s+/g, ' ').toLowerCase();
if (!normalized) throw new Error('Invalid schedule format');
const recurring =
normalized.startsWith('every') ||
normalized.startsWith('каждый') ||
normalized.startsWith('каждая') ||
normalized.startsWith('каждое') ||
normalized.includes('daily') ||
normalized.includes('ежедневно');
const timeMatch = normalized.match(/(\d{1,2}):(\d{2})/);
const amPmMatch = normalized.match(/(\d{1,2})(am|pm)/);
let hour: number;
let minute: number;
if (timeMatch) {
hour = parseInt(timeMatch[1], 10);
minute = parseInt(timeMatch[2], 10);
} else if (amPmMatch) {
hour = parseInt(amPmMatch[1], 10);
minute = 0;
if (amPmMatch[2] === 'pm' && hour !== 12) hour += 12;
if (amPmMatch[2] === 'am' && hour === 12) hour = 0;
} else {
throw new Error('Invalid schedule format');
}
const allDays = { ...EN_DAYS, ...RU_DAYS };
let dayOfWeek: number | undefined;
for (const [name, num] of Object.entries(allDays)) {
if (normalized.includes(name)) {
dayOfWeek = num;
break;
}
}
const result: Schedule = { hour, minute, recurring };
if (dayOfWeek !== undefined) result.dayOfWeek = dayOfWeek;
return result;
}
Run the tests: all green. Green phase complete.
Step 3. Refactoring Under Test Coverage
Prompt:
Refactor parseSchedule. Extract time and day parsing into separate functions.
Tests must not change. All must remain green.
Claude performs the refactoring and runs tests after each change. The tests are the safety net: if the refactoring breaks something, it’s immediately visible.
Step 4. Extending Through New Tests
Need to add support for the “every 2 hours” format? Write the test first:
it('parses interval "every 2 hours"', () => {
const result = parseSchedule('every 2 hours');
expect(result).toEqual({
intervalHours: 2,
recurring: true,
});
});
Test fails. Claude prompt: “Make this test pass without breaking the existing ones.” The cycle repeats.
Claude Code Prompt Templates for Test-First
Generating Tests for a New Module
Write tests for [module/function].
Context: [what the module does, what data it processes].
Input: [types, examples].
Expected output: [structure, types].
Edge cases: [specific cases or "suggest them yourself"].
Framework: [vitest/jest/pytest].
DO NOT write the implementation.
Generating Implementation from Tests
Write an implementation of [module] that passes all tests in [file.test.ts].
Minimal code. No unnecessary abstractions. Only what the tests require.
Expanding Coverage
Analyze [module] and its tests.
What scenarios are not covered? Add tests for missing edge cases.
Do not modify existing tests.
Refactoring
Refactor [module]. Tests must not change.
Goal: [readability / performance / extensibility].
Run tests after changes.
Comparison: Test-First vs Code-First with AI
In practice, developers use AI in two ways: ask it to generate code first, then tests (code-first), or the other way around (test-first). The difference in outcomes is significant.
| Criterion | Code-first + AI | Test-first + AI |
|---|---|---|
| Test quality | Tests are retrofitted to the implementation and miss bugs | Tests reflect requirements and catch real problems |
| Edge case coverage | AI tests what it wrote, not what was needed | Edge cases are defined from the specification |
| Refactoring | Risky to change: tests may break | Safe: tests are bound to the contract, not the implementation |
| Speed | Faster at the start | Faster in the long run |
| API quality | API forms organically | API is designed through tests |
The key difference: with code-first, AI generates tests that check what the code does. With test-first, AI generates tests that check what the code should do. The difference becomes critical during refactoring and feature expansion.
Real Example: A Validator with LLM-as-Judge
A practical task: an AI response validation module using LLM-as-Judge. The goal is to check that an LLM’s response meets quality criteria.
Tests are written first:
describe('ResponseValidator', () => {
it('accepts response meeting all criteria', async () => {
const result = await validator.validate({
response: 'Paris is the capital of France.',
criteria: {
factual: true,
maxLength: 100,
language: 'en',
},
});
expect(result.passed).toBe(true);
expect(result.scores.factual).toBeGreaterThan(0.8);
});
it('rejects response failing factual check', async () => {
const result = await validator.validate({
response: 'Berlin is the capital of France.',
criteria: { factual: true },
});
expect(result.passed).toBe(false);
expect(result.failures).toContain('factual');
});
it('handles LLM provider timeout', async () => {
const result = await validator.validate({
response: 'test',
criteria: { factual: true },
timeout: 100,
});
expect(result.passed).toBe(false);
expect(result.error).toBe('timeout');
});
});
The tests defined: the validate() interface, the criteria structure, the result format, and behavior on timeout. This is a complete specification. Claude implements the module strictly against the contract, including error handling that’s easy to forget with a code-first approach.
If your project already has multi-agent code review in place, tests go through review alongside the code. The agents verify that tests cover all declared scenarios and that the implementation doesn’t go beyond the contract.
Patterns and Anti-Patterns
What Works Well
One test, one behavior. Claude generates a precise implementation when each test checks exactly one aspect. “Parses a date in ISO format” — good. “Parses a date and validates and formats” — bad.
Typed expectations. Instead of expect(result).toBeTruthy() — expect(result).toEqual({ ... }). The more precise the expectation, the more precise the implementation.
Tests as documentation. The test name describes behavior: it('returns empty array when no items match filter'). Claude uses it to understand intent.
What Doesn’t Work
Tests on internal implementation. expect(cache.store).toHaveBeenCalledWith('key', 'value') — binding to the implementation. The test will break on refactoring even though behavior didn’t change.
Too many mocks. If a test needs five dependencies mocked, the problem is in the architecture, not the tests. Claude will create a working mock, but the implementation will be brittle.
Generating tests and code in one prompt. “Write a function and tests for it” — that’s code-first with the illusion of test-first. The tests will be retrofitted to the implementation. Always split into two steps.
TDD for Infrastructure Code
Test-first works not just for business logic. For infrastructure modules, like a circuit breaker for edge functions, tests are especially valuable.
describe('CircuitBreaker', () => {
it('stays closed after successful calls', async () => {
const breaker = new CircuitBreaker({ failureThreshold: 3 });
await breaker.call(() => Promise.resolve('ok'));
await breaker.call(() => Promise.resolve('ok'));
expect(breaker.state).toBe('closed');
});
it('opens after reaching failure threshold', async () => {
const breaker = new CircuitBreaker({ failureThreshold: 2 });
await ignoreError(() => breaker.call(() => Promise.reject('fail')));
await ignoreError(() => breaker.call(() => Promise.reject('fail')));
expect(breaker.state).toBe('open');
});
it('rejects calls immediately when open', async () => {
const breaker = openCircuitBreaker();
await expect(breaker.call(() => Promise.resolve('ok')))
.rejects.toThrow('Circuit is open');
});
it('transitions to half-open after cooldown', async () => {
vi.useFakeTimers();
const breaker = openCircuitBreaker({ cooldownMs: 5000 });
vi.advanceTimersByTime(5000);
expect(breaker.state).toBe('half-open');
vi.useRealTimers();
});
});
A state machine circuit breaker is fully described through tests: state transitions, thresholds, timers. Claude generates an implementation that correctly handles all transitions because every transition is pinned in a test.
Metrics: What to Measure
Defect escape rate. How many bugs slip past tests into production. With test-first, this number drops because edge cases are covered before any code is written.
Refactoring time. With tests bound to the contract, refactoring is safe. Without them, every change requires manual verification.
Coverage on the first pass. With a code-first approach, coverage after the first iteration is usually 40–60%. With test-first — 80–90%, because the tests already define all the main scenarios.
Number of iterations to green. With well-written tests, Claude passes all of them on the first generation. If more than two iterations are needed, the tests are too vague or contradictory.
Getting Started
There’s no need to convert an entire project to TDD in one day. Starting with a single new module is enough.
1. Pick an isolated module. A utility function, a validator, a parser. A module without heavy dependencies where inputs and outputs are easy to define.
2. Write tests with a prompt. Use the template from the section above. Describe the input, expected output, and edge cases. Let Claude generate the test file.
3. Confirm the tests fail. Run the tests. All should be red. If something is green without an implementation — the test is incorrect.
4. Generate the implementation. In a separate prompt, ask Claude to write code that passes all the tests. Don’t add “and also do X” — only what’s in the tests.
5. Run, refactor, repeat. Green tests — refactor. New requirements — new tests. The Red → Green → Refactor cycle runs without friction when AI handles the generation.
AI doesn’t replace TDD. AI makes TDD practical. The cognitive load of designing APIs through tests drops, the speed of the cycle increases, and the quality of the contract stays on par with manual design.