LLM-as-Judge: Automated Quality Gate for LLM Outputs in Production
What is LLM-as-Judge?
LLM-as-Judge is a pattern where one language model automatically evaluates the outputs of another against defined quality criteria, acting as a scalable quality gate in production. Unlike traditional monitoring metrics, it assesses the actual content of responses — detecting hallucinations, irrelevance, and toxicity — and agrees with human raters over 80% of the time.
TL;DR
- -LLM-as-Judge agrees with human raters 80%+ of the time — comparable to two humans agreeing with each other (81%)
- -HTTP 200 OK tells you nothing about output quality; a model can hallucinate in 15% of responses with no visible signal
- -Start with 2–3 metrics: faithfulness + answer relevance for RAG, relevance + toxicity for chatbots, task completion for agents
- -DeepEval provides ready-made metric implementations with Pydantic schemas; Langfuse handles production monitoring and alerting
- -A judge prompt must include a scoring rubric with explicit criteria — vague instructions produce inconsistent scores
LLM-as-Judge is a pattern where one language model evaluates another model’s outputs against defined criteria. An automatic quality gate: every response gets checked before reaching the user, or after, for monitoring. Standard production monitoring metrics (200 OK, latency 340ms, rate limits within bounds) are useless for assessing quality — the model can hallucinate in 15% of responses while HTTP status codes tell you nothing about it.
Manual review doesn’t scale. One person can handle 100 requests a day. At 10,000, nobody can. And quality degradation usually hits at scale: after a prompt update, a model swap, or a silent change on the provider side.
This article covers how LLM-as-Judge works, which metrics to evaluate, and how to plug it into a production pipeline.
How LLM-as-Judge works and its limitations
The judge model receives a prompt with instructions plus the text being evaluated, then returns a score: a number, a category, or structured JSON. The judge doesn’t generate content. It classifies and scores. Models handle this more consistently than generation.
User: "Recommend cafes in downtown Moscow"
|
v
+--------------------+
| LLM Generator | -> "Here are 5 cafes: Coffemania near Patriarshiye..."
| (GPT-4o-mini) |
+--------------------+
|
v
+--------------------+
| LLM Judge | -> { relevance: 0.9, factuality: 0.7,
| (Claude Sonnet) | toxicity: 0.0, completeness: 0.8 }
+--------------------+
|
v
Score < threshold? -> Alert / Block / Log
Research by Zheng et al. (2023, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”) showed that GPT-4 as a judge agreed with human ratings in 80%+ of cases. Two human annotators agreed with each other about 81% of the time. The gap between an LLM judge and a human is roughly the same as the gap between two humans.
LLM output quality metrics: what to evaluate
Metric choice depends on the task. Main categories below.
Metrics for RAG systems
| Metric | What it checks | When you need it |
|---|---|---|
| Faithfulness | Response is grounded in context, no fabricated facts | Always for RAG |
| Answer Relevance | Response matches the question | Always |
| Context Relevance | Retriever returned relevant documents | Debugging retrieval |
Metrics for generative tasks
| Metric | What it checks | When you need it |
|---|---|---|
| Correctness | Factual accuracy | When a reference answer exists |
| Completeness | Response covers all aspects of the query | Complex queries |
| Toxicity | No insults, harmful content | User-facing products |
| Hallucination | Model doesn’t fabricate facts | Always |
Metrics for agent pipelines
| Metric | What it checks | When you need it |
|---|---|---|
| Tool Use Correctness | Right tool with right arguments | Agent pipelines |
| Task Completion | End result solves the task | Always for agents |
In practice, start with two or three metrics. For RAG: faithfulness + answer relevance. For a chatbot: relevance + toxicity. For an agent: task completion. Add more as you find specific problems.
Judge prompt structure for evaluation
Evaluation quality comes down to the prompt. A working template for faithfulness:
FAITHFULNESS_JUDGE_PROMPT = """You are an impartial judge evaluating the faithfulness
of an AI assistant's response.
Faithfulness means: every claim in the response is supported by the provided context.
Claims not found in context = unfaithful.
## Input
**User Question:** {question}
**Retrieved Context:** {context}
**AI Response:** {response}
## Task
1. Extract each factual claim from the AI Response
2. For each claim, check if it is supported by the Retrieved Context
3. A claim is SUPPORTED if the context contains evidence for it
4. A claim is UNSUPPORTED if the context does not mention it or contradicts it
## Output (JSON only)
{{
"claims": [
{{"claim": "...", "supported": true/false, "evidence": "..."}}
],
"score": <float 0.0-1.0, ratio of supported claims to total claims>,
"reasoning": "<one sentence summary>"
}}"""
What makes this work:
Specific criteria. “Rate the response quality” doesn’t work. “Check that every fact is backed by context” works. The more specific the instruction, the more stable the scores.
Chain-of-thought. The model first extracts claims, checks each one, then assigns a score. Without intermediate steps, scores are unstable.
Structured output. JSON with a fixed schema, score from 0 to 1, reasoning in one sentence. This makes parsing and aggregation straightforward.
Implementing LLM-as-Judge: three approaches
1. Python + LLM API
Minimal implementation, no frameworks:
import json
from litellm import completion
def evaluate_faithfulness(question: str, context: str, response: str) -> dict:
judge_response = completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{
"role": "user",
"content": FAITHFULNESS_JUDGE_PROMPT.format(
question=question,
context=context,
response=response,
)
}],
response_format={"type": "json_object"},
temperature=0,
)
result = json.loads(judge_response.choices[0].message.content)
return result
eval_result = evaluate_faithfulness(
question="Какие кафе в центре Москвы?",
context="Кофемания: Патриаршие пруды. Сёстры: Покровка 6.",
response="Рекомендую Кофеманию на Патриарших и Пушкин на Тверском бульваре.",
)
# score: 0.5 (Кофемания confirmed, Пушкин is not)
Pros: full control, minimal dependencies. Cons: you write every metric yourself, no batch processing. If you work with multiple LLM providers, litellm lets you switch between them through a single interface — more on this in the article about multi-provider LLM architecture.
2. DeepEval
Open-source framework with built-in metrics. Works like pytest for LLM outputs.
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
FaithfulnessMetric,
AnswerRelevancyMetric,
HallucinationMetric,
)
faithfulness = FaithfulnessMetric(threshold=0.7, model="gpt-4o")
relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o")
hallucination = HallucinationMetric(threshold=0.5, model="gpt-4o")
test_case = LLMTestCase(
input="Какие кафе в центре Москвы?",
actual_output="Рекомендую Кофеманию на Патриарших...",
retrieval_context=["Кофемания: Патриаршие пруды. Сёстры: Покровка 6."],
)
results = evaluate([test_case], [faithfulness, relevancy, hallucination])
14+ built-in metrics, pytest integration. LLM quality tests run alongside unit tests:
# test_llm_quality.py
from deepeval import assert_test
def test_travel_recommendations():
test_case = LLMTestCase(
input="Кафе в Москве",
actual_output=run_my_pipeline("Кафе в Москве"),
retrieval_context=get_retrieved_docs("Кафе в Москве"),
)
assert_test(test_case, [faithfulness, relevancy])
3. Langfuse Evaluations
If you already use Langfuse for tracing, evaluations plug in on top. The judge model runs against each trace and attaches a score to it. Scores can be attached to an entire trace or to individual observations. If you haven’t set up an observability stack yet, start with the practical guide to LLM observability with Langfuse.
langfuse.score(
trace_id="trace-abc-123",
name="faithfulness",
value=0.85,
comment="1 of 7 claims not supported by context",
)
For production monitoring, Langfuse fits better than DeepEval: scores are tied to real traces, visible in the dashboard, with day-over-day quality degradation charts.
Integrating LLM-as-Judge into CI/CD and production pipelines
Pre-deploy: prompt regression testing
Prompt changed? Run a dataset through the judge model before deploying. Score below threshold — deploy blocked.
# .github/workflows/llm-quality.yml
name: LLM Quality Gate
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install deepeval
- run: deepeval test run test_llm_quality.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Runtime: gate before the response
For high-stakes tasks, evaluate before sending the response:
async def generate_with_quality_gate(question: str) -> str:
response = await generate_response(question)
eval_result = await evaluate_faithfulness(
question=question,
context=retrieved_context,
response=response,
)
if eval_result["score"] < 0.7:
return "Sorry, I'm not confident in the accuracy of this answer. Try rephrasing your question."
return response
An extra LLM call per request. GPT-4o-mini as a judge costs $0.15 per million input tokens. At 10,000 requests per day with a ~500-token prompt: about $0.75/day.
Post-hoc: sample-based monitoring
The most common scenario. Evaluation runs asynchronously:
import random
traces = langfuse.fetch_traces(limit=100)
sample = random.sample(traces.data, min(100, len(traces.data)))
scores = []
for trace in sample:
result = evaluate_faithfulness(
question=trace.input,
context=trace.metadata.get("context", ""),
response=trace.output,
)
scores.append(result["score"])
langfuse.score(trace_id=trace.id, name="faithfulness", value=result["score"])
avg_score = sum(scores) / len(scores)
if avg_score < 0.75:
send_alert(f"Faithfulness degraded: {avg_score:.2f}")
Cheaper than a runtime gate, but catches trends. Average faithfulness dropped from 0.88 to 0.71 over a week — something broke: the prompt, the retriever, or a model update on the provider side.
LLM-as-Judge pitfalls and biases
Position bias
Judge models systematically prefer whichever answer appears first in pairwise comparisons. Zheng et al. (2023) measured a shift of up to 10-15%. Fix: run the evaluation twice with swapped order and average the results. Or use pointwise scoring instead of pairwise.
Verbosity bias
Longer answers get higher scores, even when a shorter answer is more accurate. In the judge prompt, explicitly state “response length does not affect the score” and include an example where a short answer receives the highest mark.
Self-enhancement bias
GPT-4 gives higher scores to GPT-4 outputs. Claude prefers Claude outputs. Fix: use a judge model from a different provider than the generator. Generate with GPT-4o, evaluate with Claude Sonnet. Or the other way around. The broader question of trusting LLM outputs is a topic of its own — more on this in TruthGuard: when AI agents lie.
Cost
Every evaluation is an LLM call. A runtime gate on 10,000 requests/day means 10,000 extra calls. Options: a cheap model as judge (GPT-4o-mini, Claude Haiku), sample-based evaluation, caching scores for similar pairs.
The judge hallucinates too
A judge model can give a high score to a response full of fabricated facts, if the hallucination sounds plausible. Partial fix: chain-of-thought + structured output. There is no complete fix. This is a fundamental limitation of the approach.
Choosing the right judge model for each scenario
| Scenario | Judge model | Why |
|---|---|---|
| Pre-deploy tests | GPT-4o or Claude Sonnet 4 | Accuracy matters more than speed |
| Runtime gate | GPT-4o-mini or Claude Haiku | Cheap and fast |
| Post-hoc monitoring | GPT-4o-mini | Bulk processing |
Rule of thumb: the judge model should be at least as capable as the generator. GPT-4o-mini judging GPT-4o-mini works. GPT-4o-mini judging Claude Opus is unreliable.
temperature=0 for judge calls is mandatory.
Tools for LLM evaluation
| Tool | Focus | LLM-as-Judge | Self-hosted | Price |
|---|---|---|---|---|
| DeepEval | Testing | 14+ metrics | Yes (OSS) | Free |
| Ragas | RAG evaluation | Faithfulness, relevance | Yes (OSS) | Free |
| Langfuse | Observability + evals | Evaluator templates | Yes (OSS) | Free (self-hosted) |
| Phoenix (Arize) | Observability + evals | Hallucination, QA | Yes (OSS) | Free |
| Braintrust | Evals + logging | Custom scorers | Cloud | Free tier |
For a startup: DeepEval for pre-deploy tests + Langfuse for production monitoring. Two open-source tools cover the entire cycle.
Production setup for LLM-as-Judge
+---------------------------------------------------+
| CI/CD Pipeline |
| |
| PR with prompt change |
| | |
| v |
| DeepEval: dataset x new prompt -> scores |
| | |
| v |
| Score < threshold? -> Block merge |
+---------------------------------------------------+
+---------------------------------------------------+
| Production |
| |
| User request -> LLM -> Response -> User |
| | |
| v (async) |
| Langfuse trace |
| | |
| v (cron, hourly) |
| Judge evaluation (sample) |
| | |
| v |
| Score dashboard + alerts |
+---------------------------------------------------+
Where to start: step-by-step plan
- Pick one metric. For RAG: faithfulness. For a chatbot: answer relevance.
- Collect 20-30 examples by hand: questions, answers, ratings (good/bad). A golden dataset for calibration.
- Write a judge prompt, run it against the golden dataset. Agreement with human ratings below 70%? Revise the prompt.
- Add DeepEval to CI for tests on prompt changes.
- Set up Langfuse evaluations for production monitoring.
From zero to a working quality gate: two to three days. Golden dataset + judge prompt: a couple of hours.