Tutorials AI Ops

LLM-as-Judge: Automated Quality Gate for LLM Outputs in Production

What is LLM-as-Judge?

LLM-as-Judge is a pattern where one language model automatically evaluates the outputs of another against defined quality criteria, acting as a scalable quality gate in production. Unlike traditional monitoring metrics, it assesses the actual content of responses — detecting hallucinations, irrelevance, and toxicity — and agrees with human raters over 80% of the time.

TL;DR

  • -LLM-as-Judge agrees with human raters 80%+ of the time — comparable to two humans agreeing with each other (81%)
  • -HTTP 200 OK tells you nothing about output quality; a model can hallucinate in 15% of responses with no visible signal
  • -Start with 2–3 metrics: faithfulness + answer relevance for RAG, relevance + toxicity for chatbots, task completion for agents
  • -DeepEval provides ready-made metric implementations with Pydantic schemas; Langfuse handles production monitoring and alerting
  • -A judge prompt must include a scoring rubric with explicit criteria — vague instructions produce inconsistent scores

LLM-as-Judge is a pattern where one language model evaluates another model’s outputs against defined criteria. An automatic quality gate: every response gets checked before reaching the user, or after, for monitoring. Standard production monitoring metrics (200 OK, latency 340ms, rate limits within bounds) are useless for assessing quality — the model can hallucinate in 15% of responses while HTTP status codes tell you nothing about it.

Manual review doesn’t scale. One person can handle 100 requests a day. At 10,000, nobody can. And quality degradation usually hits at scale: after a prompt update, a model swap, or a silent change on the provider side.

This article covers how LLM-as-Judge works, which metrics to evaluate, and how to plug it into a production pipeline.

How LLM-as-Judge works and its limitations

The judge model receives a prompt with instructions plus the text being evaluated, then returns a score: a number, a category, or structured JSON. The judge doesn’t generate content. It classifies and scores. Models handle this more consistently than generation.

User: "Recommend cafes in downtown Moscow"
          |
          v
+--------------------+
|   LLM Generator    | -> "Here are 5 cafes: Coffemania near Patriarshiye..."
|   (GPT-4o-mini)    |
+--------------------+
          |
          v
+--------------------+
|   LLM Judge        | -> { relevance: 0.9, factuality: 0.7,
|   (Claude Sonnet)  |     toxicity: 0.0, completeness: 0.8 }
+--------------------+
          |
          v
   Score < threshold? -> Alert / Block / Log

Research by Zheng et al. (2023, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”) showed that GPT-4 as a judge agreed with human ratings in 80%+ of cases. Two human annotators agreed with each other about 81% of the time. The gap between an LLM judge and a human is roughly the same as the gap between two humans.

LLM output quality metrics: what to evaluate

Metric choice depends on the task. Main categories below.

Metrics for RAG systems

MetricWhat it checksWhen you need it
FaithfulnessResponse is grounded in context, no fabricated factsAlways for RAG
Answer RelevanceResponse matches the questionAlways
Context RelevanceRetriever returned relevant documentsDebugging retrieval

Metrics for generative tasks

MetricWhat it checksWhen you need it
CorrectnessFactual accuracyWhen a reference answer exists
CompletenessResponse covers all aspects of the queryComplex queries
ToxicityNo insults, harmful contentUser-facing products
HallucinationModel doesn’t fabricate factsAlways

Metrics for agent pipelines

MetricWhat it checksWhen you need it
Tool Use CorrectnessRight tool with right argumentsAgent pipelines
Task CompletionEnd result solves the taskAlways for agents

In practice, start with two or three metrics. For RAG: faithfulness + answer relevance. For a chatbot: relevance + toxicity. For an agent: task completion. Add more as you find specific problems.

Judge prompt structure for evaluation

Evaluation quality comes down to the prompt. A working template for faithfulness:

FAITHFULNESS_JUDGE_PROMPT = """You are an impartial judge evaluating the faithfulness
of an AI assistant's response.

Faithfulness means: every claim in the response is supported by the provided context.
Claims not found in context = unfaithful.

## Input
**User Question:** {question}
**Retrieved Context:** {context}
**AI Response:** {response}

## Task
1. Extract each factual claim from the AI Response
2. For each claim, check if it is supported by the Retrieved Context
3. A claim is SUPPORTED if the context contains evidence for it
4. A claim is UNSUPPORTED if the context does not mention it or contradicts it

## Output (JSON only)
{{
  "claims": [
    {{"claim": "...", "supported": true/false, "evidence": "..."}}
  ],
  "score": <float 0.0-1.0, ratio of supported claims to total claims>,
  "reasoning": "<one sentence summary>"
}}"""

What makes this work:

Specific criteria. “Rate the response quality” doesn’t work. “Check that every fact is backed by context” works. The more specific the instruction, the more stable the scores.

Chain-of-thought. The model first extracts claims, checks each one, then assigns a score. Without intermediate steps, scores are unstable.

Structured output. JSON with a fixed schema, score from 0 to 1, reasoning in one sentence. This makes parsing and aggregation straightforward.

Implementing LLM-as-Judge: three approaches

1. Python + LLM API

Minimal implementation, no frameworks:

import json
from litellm import completion

def evaluate_faithfulness(question: str, context: str, response: str) -> dict:
    judge_response = completion(
        model="anthropic/claude-sonnet-4-20250514",
        messages=[{
            "role": "user",
            "content": FAITHFULNESS_JUDGE_PROMPT.format(
                question=question,
                context=context,
                response=response,
            )
        }],
        response_format={"type": "json_object"},
        temperature=0,
    )

    result = json.loads(judge_response.choices[0].message.content)
    return result

eval_result = evaluate_faithfulness(
    question="Какие кафе в центре Москвы?",
    context="Кофемания: Патриаршие пруды. Сёстры: Покровка 6.",
    response="Рекомендую Кофеманию на Патриарших и Пушкин на Тверском бульваре.",
)
# score: 0.5 (Кофемания confirmed, Пушкин is not)

Pros: full control, minimal dependencies. Cons: you write every metric yourself, no batch processing. If you work with multiple LLM providers, litellm lets you switch between them through a single interface — more on this in the article about multi-provider LLM architecture.

2. DeepEval

Open-source framework with built-in metrics. Works like pytest for LLM outputs.

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    HallucinationMetric,
)

faithfulness = FaithfulnessMetric(threshold=0.7, model="gpt-4o")
relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o")
hallucination = HallucinationMetric(threshold=0.5, model="gpt-4o")

test_case = LLMTestCase(
    input="Какие кафе в центре Москвы?",
    actual_output="Рекомендую Кофеманию на Патриарших...",
    retrieval_context=["Кофемания: Патриаршие пруды. Сёстры: Покровка 6."],
)

results = evaluate([test_case], [faithfulness, relevancy, hallucination])

14+ built-in metrics, pytest integration. LLM quality tests run alongside unit tests:

# test_llm_quality.py
from deepeval import assert_test

def test_travel_recommendations():
    test_case = LLMTestCase(
        input="Кафе в Москве",
        actual_output=run_my_pipeline("Кафе в Москве"),
        retrieval_context=get_retrieved_docs("Кафе в Москве"),
    )
    assert_test(test_case, [faithfulness, relevancy])

3. Langfuse Evaluations

If you already use Langfuse for tracing, evaluations plug in on top. The judge model runs against each trace and attaches a score to it. Scores can be attached to an entire trace or to individual observations. If you haven’t set up an observability stack yet, start with the practical guide to LLM observability with Langfuse.

langfuse.score(
    trace_id="trace-abc-123",
    name="faithfulness",
    value=0.85,
    comment="1 of 7 claims not supported by context",
)

For production monitoring, Langfuse fits better than DeepEval: scores are tied to real traces, visible in the dashboard, with day-over-day quality degradation charts.

Integrating LLM-as-Judge into CI/CD and production pipelines

Pre-deploy: prompt regression testing

Prompt changed? Run a dataset through the judge model before deploying. Score below threshold — deploy blocked.

# .github/workflows/llm-quality.yml
name: LLM Quality Gate
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install deepeval
      - run: deepeval test run test_llm_quality.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Runtime: gate before the response

For high-stakes tasks, evaluate before sending the response:

async def generate_with_quality_gate(question: str) -> str:
    response = await generate_response(question)

    eval_result = await evaluate_faithfulness(
        question=question,
        context=retrieved_context,
        response=response,
    )

    if eval_result["score"] < 0.7:
        return "Sorry, I'm not confident in the accuracy of this answer. Try rephrasing your question."

    return response

An extra LLM call per request. GPT-4o-mini as a judge costs $0.15 per million input tokens. At 10,000 requests per day with a ~500-token prompt: about $0.75/day.

Post-hoc: sample-based monitoring

The most common scenario. Evaluation runs asynchronously:

import random

traces = langfuse.fetch_traces(limit=100)
sample = random.sample(traces.data, min(100, len(traces.data)))

scores = []
for trace in sample:
    result = evaluate_faithfulness(
        question=trace.input,
        context=trace.metadata.get("context", ""),
        response=trace.output,
    )
    scores.append(result["score"])
    langfuse.score(trace_id=trace.id, name="faithfulness", value=result["score"])

avg_score = sum(scores) / len(scores)
if avg_score < 0.75:
    send_alert(f"Faithfulness degraded: {avg_score:.2f}")

Cheaper than a runtime gate, but catches trends. Average faithfulness dropped from 0.88 to 0.71 over a week — something broke: the prompt, the retriever, or a model update on the provider side.

LLM-as-Judge pitfalls and biases

Position bias

Judge models systematically prefer whichever answer appears first in pairwise comparisons. Zheng et al. (2023) measured a shift of up to 10-15%. Fix: run the evaluation twice with swapped order and average the results. Or use pointwise scoring instead of pairwise.

Verbosity bias

Longer answers get higher scores, even when a shorter answer is more accurate. In the judge prompt, explicitly state “response length does not affect the score” and include an example where a short answer receives the highest mark.

Self-enhancement bias

GPT-4 gives higher scores to GPT-4 outputs. Claude prefers Claude outputs. Fix: use a judge model from a different provider than the generator. Generate with GPT-4o, evaluate with Claude Sonnet. Or the other way around. The broader question of trusting LLM outputs is a topic of its own — more on this in TruthGuard: when AI agents lie.

Cost

Every evaluation is an LLM call. A runtime gate on 10,000 requests/day means 10,000 extra calls. Options: a cheap model as judge (GPT-4o-mini, Claude Haiku), sample-based evaluation, caching scores for similar pairs.

The judge hallucinates too

A judge model can give a high score to a response full of fabricated facts, if the hallucination sounds plausible. Partial fix: chain-of-thought + structured output. There is no complete fix. This is a fundamental limitation of the approach.

Choosing the right judge model for each scenario

ScenarioJudge modelWhy
Pre-deploy testsGPT-4o or Claude Sonnet 4Accuracy matters more than speed
Runtime gateGPT-4o-mini or Claude HaikuCheap and fast
Post-hoc monitoringGPT-4o-miniBulk processing

Rule of thumb: the judge model should be at least as capable as the generator. GPT-4o-mini judging GPT-4o-mini works. GPT-4o-mini judging Claude Opus is unreliable.

temperature=0 for judge calls is mandatory.

Tools for LLM evaluation

ToolFocusLLM-as-JudgeSelf-hostedPrice
DeepEvalTesting14+ metricsYes (OSS)Free
RagasRAG evaluationFaithfulness, relevanceYes (OSS)Free
LangfuseObservability + evalsEvaluator templatesYes (OSS)Free (self-hosted)
Phoenix (Arize)Observability + evalsHallucination, QAYes (OSS)Free
BraintrustEvals + loggingCustom scorersCloudFree tier

For a startup: DeepEval for pre-deploy tests + Langfuse for production monitoring. Two open-source tools cover the entire cycle.

Production setup for LLM-as-Judge

+---------------------------------------------------+
|                   CI/CD Pipeline                   |
|                                                    |
|  PR with prompt change                             |
|      |                                             |
|      v                                             |
|  DeepEval: dataset x new prompt -> scores          |
|      |                                             |
|      v                                             |
|  Score < threshold? -> Block merge                 |
+---------------------------------------------------+

+---------------------------------------------------+
|                   Production                       |
|                                                    |
|  User request -> LLM -> Response -> User           |
|                   |                                |
|                   v (async)                         |
|              Langfuse trace                         |
|                   |                                |
|                   v (cron, hourly)                  |
|         Judge evaluation (sample)                  |
|                   |                                |
|                   v                                |
|         Score dashboard + alerts                   |
+---------------------------------------------------+

Where to start: step-by-step plan

  1. Pick one metric. For RAG: faithfulness. For a chatbot: answer relevance.
  2. Collect 20-30 examples by hand: questions, answers, ratings (good/bad). A golden dataset for calibration.
  3. Write a judge prompt, run it against the golden dataset. Agreement with human ratings below 70%? Revise the prompt.
  4. Add DeepEval to CI for tests on prompt changes.
  5. Set up Langfuse evaluations for production monitoring.

From zero to a working quality gate: two to three days. Golden dataset + judge prompt: a couple of hours.