TDD with AI: Claude Writes Tests First, Then the Implementation

Most teams know TDD works. Few actually practice it. The friction is real: writing a test before the implementation means thinking through the API, edge cases, and module contract before a single line of logic exists. That’s cognitively expensive.

An AI assistant removes this barrier. Claude Code turns test-first from a discipline exercise into a natural development workflow. Tests become the specification, and the implementation is generated against them.

Why TDD Stalls Without AI

The classic TDD cycle: Red → Green → Refactor. Write a failing test, make it pass with minimal code, then refactor. In practice, the cycle breaks at the first step.

A developer sits down to write a test for a new module. But writing the test requires defining the interface, which requires understanding the architecture, which requires at least mentally sketching the implementation. The circle closes on itself.

Three concrete problems:

Designing an API from scratch. A test requires calling a function that doesn’t exist yet. What arguments does it take? What does it return? What errors does it throw? Without a prototype implementation, these questions hang in the air.

Edge cases in the dark. Good tests cover boundary conditions. But boundary conditions only become visible once you understand the implementation. Empty arrays, null, timeouts, concurrent access — these are all much easier to spot when looking at code rather than a blank test file.

Motivation. A test that fails with “function not found” gives no sense of progress. The developer switches to the implementation “just to quickly verify the idea” — and never returns to test-first.

AI removes all three barriers. Claude generates a test specification from a description of the task, suggests edge cases based on data types and patterns, and creates an implementation strictly against the test contract.

Workflow: Test-First with Claude Code

The workflow has four steps. Each step has a concrete prompt and expected outcome.

Step 1. Specification via Tests

Claude Code prompt:

Write tests for the parseSchedule module.
Input: a free-form schedule string ("every Monday at 10:00", "daily at 9am").
Output: an object { dayOfWeek?: number, hour: number, minute: number, recurring: boolean }.
Cover: valid strings in EN/RU, invalid input, empty string, null.
Use vitest. Do not write the implementation.

Claude generates the test file:

import { describe, it, expect } from 'vitest';
import { parseSchedule } from './parseSchedule';

describe('parseSchedule', () => {
  describe('english input', () => {
    it('parses "every Monday at 10:00"', () => {
      const result = parseSchedule('every Monday at 10:00');
      expect(result).toEqual({
        dayOfWeek: 1,
        hour: 10,
        minute: 0,
        recurring: true,
      });
    });

    it('parses "daily at 9am"', () => {
      const result = parseSchedule('daily at 9am');
      expect(result).toEqual({
        hour: 9,
        minute: 0,
        recurring: true,
      });
    });

    it('parses "Friday 14:30"', () => {
      const result = parseSchedule('Friday 14:30');
      expect(result).toEqual({
        dayOfWeek: 5,
        hour: 14,
        minute: 30,
        recurring: false,
      });
    });
  });

  describe('russian input', () => {
    it('parses "каждый понедельник в 10:00"', () => {
      const result = parseSchedule('каждый понедельник в 10:00');
      expect(result).toEqual({
        dayOfWeek: 1,
        hour: 10,
        minute: 0,
        recurring: true,
      });
    });

    it('parses "ежедневно в 9:00"', () => {
      const result = parseSchedule('ежедневно в 9:00');
      expect(result).toEqual({
        hour: 9,
        minute: 0,
        recurring: true,
      });
    });
  });

  describe('edge cases', () => {
    it('throws on empty string', () => {
      expect(() => parseSchedule('')).toThrow('Invalid schedule format');
    });

    it('throws on null', () => {
      expect(() => parseSchedule(null as any)).toThrow();
    });

    it('throws on unrecognized format', () => {
      expect(() => parseSchedule('maybe sometime')).toThrow(
        'Invalid schedule format'
      );
    });

    it('handles extra whitespace', () => {
      const result = parseSchedule('  daily   at   9am  ');
      expect(result).toEqual({
        hour: 9,
        minute: 0,
        recurring: true,
      });
    });
  });
});

All tests fail — parseSchedule doesn’t exist. That’s expected: the Red phase is complete.

What just happened: the tests locked in the contract. Input type, output structure, behavior on edge cases, supported formats. This file is already a full specification for the module.

Step 2. Minimal Implementation

Prompt:

Write an implementation of parseSchedule that passes all tests in parseSchedule.test.ts.
Minimal code. No extra features. Only what the tests require.

Claude reads the test file, extracts the expectations, and generates the implementation:

const EN_DAYS: Record<string, number> = {
  sunday: 0, monday: 1, tuesday: 2, wednesday: 3,
  thursday: 4, friday: 5, saturday: 6,
};

const RU_DAYS: Record<string, number> = {
  воскресенье: 0, понедельник: 1, вторник: 2, среда: 3,
  четверг: 4, пятница: 5, суббота: 6,
};

interface Schedule {
  dayOfWeek?: number;
  hour: number;
  minute: number;
  recurring: boolean;
}

export function parseSchedule(input: string): Schedule {
  if (!input || typeof input !== 'string') {
    throw new Error('Invalid schedule format');
  }

  const normalized = input.trim().replace(/\s+/g, ' ').toLowerCase();
  if (!normalized) throw new Error('Invalid schedule format');

  const recurring =
    normalized.startsWith('every') ||
    normalized.startsWith('каждый') ||
    normalized.startsWith('каждая') ||
    normalized.startsWith('каждое') ||
    normalized.includes('daily') ||
    normalized.includes('ежедневно');

  const timeMatch = normalized.match(/(\d{1,2}):(\d{2})/);
  const amPmMatch = normalized.match(/(\d{1,2})(am|pm)/);

  let hour: number;
  let minute: number;

  if (timeMatch) {
    hour = parseInt(timeMatch[1], 10);
    minute = parseInt(timeMatch[2], 10);
  } else if (amPmMatch) {
    hour = parseInt(amPmMatch[1], 10);
    minute = 0;
    if (amPmMatch[2] === 'pm' && hour !== 12) hour += 12;
    if (amPmMatch[2] === 'am' && hour === 12) hour = 0;
  } else {
    throw new Error('Invalid schedule format');
  }

  const allDays = { ...EN_DAYS, ...RU_DAYS };
  let dayOfWeek: number | undefined;

  for (const [name, num] of Object.entries(allDays)) {
    if (normalized.includes(name)) {
      dayOfWeek = num;
      break;
    }
  }

  const result: Schedule = { hour, minute, recurring };
  if (dayOfWeek !== undefined) result.dayOfWeek = dayOfWeek;

  return result;
}

Run the tests: all green. Green phase complete.

Step 3. Refactoring Under Test Coverage

Prompt:

Refactor parseSchedule. Extract time and day parsing into separate functions.
Tests must not change. All must remain green.

Claude performs the refactoring and runs tests after each change. The tests are the safety net: if the refactoring breaks something, it’s immediately visible.

Step 4. Extending Through New Tests

Need to add support for the “every 2 hours” format? Write the test first:

it('parses interval "every 2 hours"', () => {
  const result = parseSchedule('every 2 hours');
  expect(result).toEqual({
    intervalHours: 2,
    recurring: true,
  });
});

Test fails. Claude prompt: “Make this test pass without breaking the existing ones.” The cycle repeats.

Claude Code Prompt Templates for Test-First

Generating Tests for a New Module

Write tests for [module/function].
Context: [what the module does, what data it processes].
Input: [types, examples].
Expected output: [structure, types].
Edge cases: [specific cases or "suggest them yourself"].
Framework: [vitest/jest/pytest].
DO NOT write the implementation.

Generating Implementation from Tests

Write an implementation of [module] that passes all tests in [file.test.ts].
Minimal code. No unnecessary abstractions. Only what the tests require.

Expanding Coverage

Analyze [module] and its tests.
What scenarios are not covered? Add tests for missing edge cases.
Do not modify existing tests.

Refactoring

Refactor [module]. Tests must not change.
Goal: [readability / performance / extensibility].
Run tests after changes.

Comparison: Test-First vs Code-First with AI

In practice, developers use AI in two ways: ask it to generate code first, then tests (code-first), or the other way around (test-first). The difference in outcomes is significant.

Criterion	Code-first + AI	Test-first + AI
Test quality	Tests are retrofitted to the implementation and miss bugs	Tests reflect requirements and catch real problems
Edge case coverage	AI tests what it wrote, not what was needed	Edge cases are defined from the specification
Refactoring	Risky to change: tests may break	Safe: tests are bound to the contract, not the implementation
Speed	Faster at the start	Faster in the long run
API quality	API forms organically	API is designed through tests

The key difference: with code-first, AI generates tests that check what the code does. With test-first, AI generates tests that check what the code should do. The difference becomes critical during refactoring and feature expansion.

Real Example: A Validator with LLM-as-Judge

A practical task: an AI response validation module using LLM-as-Judge. The goal is to check that an LLM’s response meets quality criteria.

Tests are written first:

describe('ResponseValidator', () => {
  it('accepts response meeting all criteria', async () => {
    const result = await validator.validate({
      response: 'Paris is the capital of France.',
      criteria: {
        factual: true,
        maxLength: 100,
        language: 'en',
      },
    });
    expect(result.passed).toBe(true);
    expect(result.scores.factual).toBeGreaterThan(0.8);
  });

  it('rejects response failing factual check', async () => {
    const result = await validator.validate({
      response: 'Berlin is the capital of France.',
      criteria: { factual: true },
    });
    expect(result.passed).toBe(false);
    expect(result.failures).toContain('factual');
  });

  it('handles LLM provider timeout', async () => {
    const result = await validator.validate({
      response: 'test',
      criteria: { factual: true },
      timeout: 100,
    });
    expect(result.passed).toBe(false);
    expect(result.error).toBe('timeout');
  });
});

The tests defined: the validate() interface, the criteria structure, the result format, and behavior on timeout. This is a complete specification. Claude implements the module strictly against the contract, including error handling that’s easy to forget with a code-first approach.

If your project already has multi-agent code review in place, tests go through review alongside the code. The agents verify that tests cover all declared scenarios and that the implementation doesn’t go beyond the contract.

Patterns and Anti-Patterns

What Works Well

One test, one behavior. Claude generates a precise implementation when each test checks exactly one aspect. “Parses a date in ISO format” — good. “Parses a date and validates and formats” — bad.

Typed expectations. Instead of expect(result).toBeTruthy() — expect(result).toEqual({ ... }). The more precise the expectation, the more precise the implementation.

Tests as documentation. The test name describes behavior: it('returns empty array when no items match filter'). Claude uses it to understand intent.

What Doesn’t Work

Tests on internal implementation. expect(cache.store).toHaveBeenCalledWith('key', 'value') — binding to the implementation. The test will break on refactoring even though behavior didn’t change.

Too many mocks. If a test needs five dependencies mocked, the problem is in the architecture, not the tests. Claude will create a working mock, but the implementation will be brittle.

Generating tests and code in one prompt. “Write a function and tests for it” — that’s code-first with the illusion of test-first. The tests will be retrofitted to the implementation. Always split into two steps.

TDD for Infrastructure Code

Test-first works not just for business logic. For infrastructure modules, like a circuit breaker for edge functions, tests are especially valuable.

describe('CircuitBreaker', () => {
  it('stays closed after successful calls', async () => {
    const breaker = new CircuitBreaker({ failureThreshold: 3 });
    await breaker.call(() => Promise.resolve('ok'));
    await breaker.call(() => Promise.resolve('ok'));
    expect(breaker.state).toBe('closed');
  });

  it('opens after reaching failure threshold', async () => {
    const breaker = new CircuitBreaker({ failureThreshold: 2 });
    await ignoreError(() => breaker.call(() => Promise.reject('fail')));
    await ignoreError(() => breaker.call(() => Promise.reject('fail')));
    expect(breaker.state).toBe('open');
  });

  it('rejects calls immediately when open', async () => {
    const breaker = openCircuitBreaker();
    await expect(breaker.call(() => Promise.resolve('ok')))
      .rejects.toThrow('Circuit is open');
  });

  it('transitions to half-open after cooldown', async () => {
    vi.useFakeTimers();
    const breaker = openCircuitBreaker({ cooldownMs: 5000 });
    vi.advanceTimersByTime(5000);
    expect(breaker.state).toBe('half-open');
    vi.useRealTimers();
  });
});

A state machine circuit breaker is fully described through tests: state transitions, thresholds, timers. Claude generates an implementation that correctly handles all transitions because every transition is pinned in a test.

Metrics: What to Measure

Defect escape rate. How many bugs slip past tests into production. With test-first, this number drops because edge cases are covered before any code is written.

Refactoring time. With tests bound to the contract, refactoring is safe. Without them, every change requires manual verification.

Coverage on the first pass. With a code-first approach, coverage after the first iteration is usually 40–60%. With test-first — 80–90%, because the tests already define all the main scenarios.

Number of iterations to green. With well-written tests, Claude passes all of them on the first generation. If more than two iterations are needed, the tests are too vague or contradictory.

Getting Started

There’s no need to convert an entire project to TDD in one day. Starting with a single new module is enough.

1. Pick an isolated module. A utility function, a validator, a parser. A module without heavy dependencies where inputs and outputs are easy to define.

2. Write tests with a prompt. Use the template from the section above. Describe the input, expected output, and edge cases. Let Claude generate the test file.

3. Confirm the tests fail. Run the tests. All should be red. If something is green without an implementation — the test is incorrect.

4. Generate the implementation. In a separate prompt, ask Claude to write code that passes all the tests. Don’t add “and also do X” — only what’s in the tests.

5. Run, refactor, repeat. Green tests — refactor. New requirements — new tests. The Red → Green → Refactor cycle runs without friction when AI handles the generation.

AI doesn’t replace TDD. AI makes TDD practical. The cognitive load of designing APIs through tests drops, the speed of the cycle increases, and the quality of the contract stays on par with manual design.

Need help with AI-assisted development workflows? I help startups build AI products and automate processes — belov.works.

FAQ

When Claude generates tests first, how do you prevent it from writing tests that are trivially easy to pass — effectively reverse-engineering the implementation instead of the spec?

The key is specificity in the prompt. Vague inputs like “write tests for a user service” produce tests fitted to whatever implementation the model imagines. Concrete constraints force real contracts: specify exact input types, exact output shapes, and at least three non-obvious edge cases by name (e.g., “concurrent writes, TTL expiry, malformed UTF-8 in input”). The more precisely you describe observable behavior rather than mechanics, the harder it becomes for Claude to write tests that only pass by accident.

Does the test-first workflow slow down delivery compared to just asking Claude to write the feature directly?

It’s slower in the first sprint, faster over the first month. Teams switching to test-first with AI typically report a 20–30% increase in initial implementation time but a 50–60% reduction in time spent debugging regressions during subsequent changes. The inflection point is around the third refactor of a module — by then, the test suite catches changes that would otherwise require 2–4 hours of manual verification per feature touched.

Can this workflow apply to integration tests and end-to-end tests, or only unit tests?

It works for integration tests with the same prompt structure, but requires more setup context: database schemas, external service contracts, authentication flows. E2E tests are harder because the “contract” involves UI state, timing, and third-party services — Claude’s test generation quality degrades significantly when the observable outcome depends on too many external variables. The sweet spot is unit and integration tests for modules with clear input/output boundaries; E2E tests are better written after the feature exists, using recordings or user flows as source material.