AI Code Review Checklist: Correctness, Security, Performance, Readability

Most defects missed in code review are logical errors and edge cases — not formatting issues, not naming conventions. Google’s “Modern Code Review: A Case Study at Google” (Sadowski et al., 2018) examined review practices at scale, and since then the volume of AI-generated code has grown while reviewers still spend the same 15–30 minutes per PR.

Below is how to structure AI code review across four categories: correctness, security, performance, readability. Priority is in exactly that order. For each category: a checklist, an LLM prompt, and real finding examples. At the end — CI pipeline integration.

Why Category Order Matters

A typical code review starts at the surface. The reviewer notices a poorly named variable, suggests a refactor, discusses style. That consumes 80% of the time. Logical errors and security issues go unnoticed.

A fixed order solves this problem:

Correctness — does the code do what it claims? Are edge cases handled?
Security — any injections, data leaks, or authorization issues?
Performance — any O(n²) where O(n) would do? Any unnecessary allocations?
Readability — will the code make sense in six months? Does it match project conventions?

Each subsequent category is less critical. A bug in authorization logic matters more than a poor variable name. The LLM handles each category in a separate pass, keeping security comments separate from style notes.

Stage 1: Correctness — Logic and Edge Cases

The most expensive category of errors. A business logic bug that slips through review and reaches production costs 10–30x more than a bug caught before merge.

Checklist

Boundary values: null, empty string, empty array, 0, negative numbers
Off-by-one: < vs <=, array indices, pagination
Concurrent access: race conditions on parallel requests
Error handling: what happens on timeout, 500 error, connection drop
Data types: integer overflow, float precision loss, implicit coercions
Contracts: is input validated? Does output match the interface?
Idempotency: is it safe to call twice?

LLM Prompt

Conduct a code review for correctness. Check ONLY logical errors.

Context:
- Language: {language}
- Function purpose: {purpose}
- Called from: {caller context}

Check each item:
1. Boundary values (null, empty collections, 0, negative)
2. Off-by-one errors in loops and conditions
3. Race conditions on concurrent access
4. Error handling (timeouts, network failures, invalid response)
5. Data types (overflow, precision loss, implicit coercions)
6. Idempotency of repeated calls

For each finding, provide:
- Line of code
- What happens with a specific input
- How to fix it (one sentence)

Do not comment on style, naming, or formatting.

Example Finding

Discount calculation function:

def calculate_discount(price: float, quantity: int) -> float:
    if quantity > 10:
        return price * 0.9
    if quantity > 50:
        return price * 0.8
    return price

The LLM catches: the second condition quantity > 50 will never be reached because quantity > 50 also satisfies quantity > 10, triggering the first branch. The 20% discount is inaccessible. The conditions need to be reversed.

These kinds of bugs pass tests if the test case for quantity=100 only checks for the presence of a discount, not its magnitude.

Stage 2: Security — Vulnerabilities and Data Leaks

Security comes second because a vulnerability in working code is more dangerous than a bug: bugs surface as errors, vulnerabilities are exploited silently.

Checklist

Injections: SQL, NoSQL, command injection, XSS
Authentication: is identity verified on every endpoint?
Authorization: can a user access only their own data?
Secrets: any hardcoded keys, tokens, or passwords?
Logging: can PII or secrets end up in logs?
Deserialization: is input validated before parsing?
Dependencies: any known CVEs in imported packages?

LLM Prompt

Conduct a security review of the code. Check ONLY vulnerabilities.

Context:
- Stack: {stack}
- This code is accessible as: {public API / internal / edge function}
- Authentication: {auth method}

Check against OWASP Top 10:
1. Injection (SQL, command, XSS) — does user input reach
   queries or HTML without sanitization?
2. Broken Access Control — can someone fetch another user's data
   by swapping user_id / tenant_id?
3. Cryptographic Failures — secrets in code, weak hashing,
   HTTP instead of HTTPS?
4. Security Misconfiguration — CORS *, debug=True, verbose errors?
5. SSRF — can the server be made to reach an internal resource?

For each finding:
- CWE number (if applicable)
- Severity: Critical / High / Medium / Low
- Exploit scenario in one sentence
- Fix in one sentence

Do not comment on performance, style, or logic.

Example Finding

Supabase Edge Function endpoint:

const { data } = await supabase
  .from('documents')
  .select('*')
  .eq('id', req.params.id);

return new Response(JSON.stringify(data));

The LLM catches: no user_id check. Any authenticated user can retrieve a document by ID, even if it belongs to someone else. IDOR (Insecure Direct Object Reference, CWE-639). Severity: High.

Fix: add .eq('user_id', user.id) or an RLS policy at the database level.

For more on how LLMs can automatically evaluate code quality, see LLM-as-Judge: Automated Quality Gate.

Stage 3: Performance — Resources and Scalability

Performance is checked after correctness and security. Fast but incorrect or insecure code is useless.

Checklist

Algorithm complexity: any nested loops over the same data?
N+1 queries: a loop with a database call inside
Unnecessary allocations: object creation in a hot loop
Missing caching: the same data requested repeatedly
Response size: SELECT * instead of needed fields
Indexes: filtering on unindexed fields
Memory leaks: subscriptions without cleanup, unclosed connections

LLM Prompt

Conduct a performance review. Check ONLY performance issues.

Context:
- Expected load: {requests per second / dataset size}
- Environment: {serverless / server / edge}
- DB: {database}

Check:
1. Algorithmic complexity — O(n²) or worse?
2. N+1 queries — a loop with a query inside?
3. Unnecessary allocations — objects created in the hot path?
4. SELECT * — returning unneeded fields?
5. Missing indexes — WHERE/ORDER BY on an unindexed field?
6. Leaks — unclosed connections, subscriptions without cleanup?

For each finding:
- Current complexity / cost
- At what data volume this becomes a problem
- Fix in one sentence

Do not comment on correctness, security, or style.

Example Finding

Loading user activity:

const users = await getActiveUsers(); // 500 users

const activity = [];
for (const user of users) {
  const logs = await db.query(
    `SELECT * FROM activity_logs WHERE user_id = $1`,
    [user.id]
  );
  activity.push({ user, logs });
}

The LLM catches: classic N+1. With 500 active users, this executes 501 database queries. On serverless, each query has 2–5ms latency — that’s 1–2.5 seconds just for queries. Solution: one query with WHERE user_id = ANY($1) and grouping on the application side.

For edge functions, this is critical. The Circuit Breaker in Deno Edge Functions article describes how timeouts from slow queries cause cascading failures.

Stage 4: Readability — Maintainability and Conventions

The last category. Readability comments should not block a merge if the first three stages pass.

Checklist

Naming: do variable and function names reflect their purpose?
Function size: more than 30–40 lines — candidate for splitting
Comments: explain “why”, not “what”
Dead code: unused variables, unreachable branches
Duplication: the same logic in multiple places
Conventions: matches the project’s style (not generic best practices)
Typing: concrete types instead of any / object

LLM Prompt

Conduct a readability review. Check ONLY code maintainability.

Context:
- Project style: {link to conventions or examples}
- Patterns: {DI framework, error handling pattern, etc.}

Check:
1. Naming — are variables and functions understandable without context?
2. Size — functions longer than 40 lines?
3. Dead code — unused imports, variables, branches?
4. Duplication — is logic repeated? Is there an existing helper?
5. Typing — any/unknown/object instead of concrete types?
6. Conventions — does the code match the style of the rest of the project?

For each finding:
- Severity: nit / suggestion / convention-violation
- One sentence on what to improve

Do not comment on logic, security, or performance.

Example Finding

const d = await fetchData(id);
const r = processResult(d);
if (r.s === 'ok') {
  await save(r.d);
}

The LLM catches: single-letter variables d, r, and properties s and d without context. Six months later, it’s impossible to understand that r.s is a status and r.d is processed data. Severity: suggestion.

Combining Into a Single Review Pipeline

Four passes produce a lot of comments. A structure is needed for the final report.

Output Format

## Code Review Summary

### Correctness (blocking)
- [C1] line 42: quantity > 50 unreachable — reverse condition order
- [C2] line 87: no null response handling from API

### Security (blocking)
- [S1] line 15: IDOR — missing user_id check (CWE-639, High)

### Performance (warnings)
- [P1] line 23-28: N+1 queries, replace with batch query

### Readability (recommendations)
- [R1] line 5: single-letter variables — rename
- [R2] line 33: dead code — remove unused import

Each finding has a unique ID for PR discussion. Correctness and Security block the merge. Performance and Readability are at the author’s discretion.

Multi-Agent Review

A single LLM misses errors. Two LLMs working independently miss fewer. In practice, a multi-agent approach finds significantly more defects than a single pass.

Practical implementation: the first agent (Claude) runs all four stages. The second agent (GPT or Gemini) checks only Correctness and Security. Findings are compared. If both agents flagged the same issue, confidence is high. If only one did — manual review is needed.

Claude Concilium implements this approach via MCP: parallel requests to multiple LLMs with result merging.

CI/CD Pipeline Integration

Automated review on every PR. Basic architecture:

# .github/workflows/ai-review.yml
name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get changed files
        id: diff
        run: |
          echo "files=$(git diff --name-only origin/main...HEAD | grep -E '\.(ts|py|go|rs)$' | tr '\n' ' ')" >> $GITHUB_OUTPUT

      - name: Run AI Review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          for file in ${{ steps.diff.outputs.files }}; do
            diff=$(git diff origin/main...HEAD -- "$file")
            # Prompt with 4 review stages
            claude --print "Review this diff: $diff" >> review_output.md
          done

      - name: Post review comments
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const review = fs.readFileSync('review_output.md', 'utf8');
            await github.rest.pulls.createReview({
              owner: context.repo.owner,
              repo: context.repo.repo,
              pull_number: context.issue.number,
              body: review,
              event: 'COMMENT'
            });

Key Setup Decisions

What to review. Only changed files. Full repo scans on every PR are too costly. Filter by extension: exclude .md, .json, and config files.

How to manage cost. One diff per API call. With 10 files in a PR, that’s 10 calls at ~2000 tokens each. With claude-sonnet-4 — roughly $0.10–0.15 per average PR. One hour of a human reviewer costs $50–100.

Whether to block merges. At the start — no. AI review as an informational comment. After calibration (2–4 weeks), you can enable blocking for Correctness and Security at severity >= High.

False positives. They will happen. Typical AI code review accuracy is 70–80%. Of 10 comments, 2–3 are irrelevant. This is acceptable if the review saves time on the remaining 7–8. To reduce false positives, add project context to the prompt: conventions, common patterns, architectural decisions.

AI Review Effectiveness Metrics

Metric	How to measure	Target
True Positive Rate	Accepted comments / All comments	> 70%
Blocked Bugs	Bugs caught by AI before merge	Grows over time
Time to Review	Time from PR to first comment	< 5 minutes
Cost per PR	API call cost per PR	< $0.10
Human Review Time	Manual review time after AI	Drops by 30–50%

Track data from day one. Within a month, you’ll know which categories yield the most findings and which prompts need refinement.

Getting Started

Week 1. Take the Correctness prompt. Manually run the last 5 PRs from your team through Claude or GPT. Record how many findings matched what the human reviewer caught (or missed).

Week 2. Add the Security prompt. Compare findings with SAST tools (Semgrep, CodeQL). LLMs often catch logical vulnerabilities that SAST misses, but miss pattern-based ones that SAST finds reliably. They complement each other.

Week 3. Set up the CI pipeline. Start with one repository, one language. PR comment without blocking.

Week 4. Calculate metrics. True Positive Rate below 50%? Refine prompts, add project context. Above 70%? Enable blocking for Critical/High.

Four categories in a fixed order. A separate pass for each. A prompt that forbids commenting on other categories. That’s enough to catch most defects before merge.