Why shouldn't I store prompts in code?

At scale, prompts in code create three problems: every change requires a deployment, there's no versioning or rollback, and non-engineers can't iterate on prompts. A prompt management system like Langfuse decouples prompt changes from code deployments.

How do you version prompts in production?

Each prompt gets a unique name and version number. The application fetches the active version at runtime from a prompt registry (e.g., Langfuse). This allows A/B testing, instant rollback, and change tracking without touching application code.

How do you test prompts before deploying to production?

Build an eval pipeline: a dataset of test inputs with expected outputs, a scoring function (exact match, LLM-as-Judge, or custom metrics), and a CI step that runs the eval on every prompt change. Block deployment if accuracy drops below threshold.

Prompt Engineering System: Managing 50+ Prompts in Production

The average LLM project in production uses 20–50 prompts. Classification, summarization, data extraction, response generation, quality evaluation. Each prompt requires iteration, and each iteration can break something that was working. At 50 prompts, managing them manually becomes chaos: who changed the classifier prompt? Why did summarizer accuracy drop? Which version is in production right now?

This article covers how to build a prompt management system that scales from 5 to 500 prompts.

Why You Can’t Store Prompts in Code

A prompt looks like a string. Developers store it in code, next to the call logic. This works fine when there are only a few prompts and iterations are infrequent.

Problems start at scale:

Changing a prompt requires deploying the app. The prompt is hardcoded. To fix a single word in a system prompt, you need a PR, review, merge, deploy. Iteration cycle: hours instead of minutes.

No versioning. Git stores history, but a diff on a 2,000-character prompt is unreadable. There’s no fast path to roll back a prompt to a previous version without rolling back the entire app.

No link between version and metrics. Prompt changed, quality dropped. Connecting a specific prompt version to specific metrics is manual work when the prompt lives in code.

Cross-team chaos. The product manager wants to adjust the tone. The ML engineer is optimizing tokens. The developer is refactoring the template. All three are editing the same file, and the outcome is unpredictable.

Anatomy of a Prompt Engineering System

A mature prompt management system has four layers:

┌─────────────────────────────────────────────────┐
│              Prompt Engineering System            │
├────────────┬────────────┬────────────┬──────────┤
│  Registry  │  Testing   │  Deploy    │ Monitor  │
│            │            │            │          │
├────────────┼────────────┼────────────┼──────────┤
│ Storage    │ Pre-deploy │ Canary /   │ Metrics  │
│ + versions │ eval       │ A/B rollout│ + alerts │
└────────────┴────────────┴────────────┴──────────┘

Registry — a centralized prompt store with versioning, metadata, and access control.

Testing — automated quality evaluation of a prompt against test datasets before deploying to production.

Deploy — a mechanism to push a new prompt version to production without deploying the application.

Monitor — tracking quality metrics tied to specific prompt versions.

You don’t need to build all four layers at once. A minimum viable system is registry + deploy. Without testing and monitoring, you’re flying blind.

Registry: Centralized Prompt Storage

The registry solves the basic problem: a single source of truth for all prompts. Two approaches.

Approach 1: Langfuse Prompt Management

Langfuse provides prompt management out of the box. Each prompt is a named entity with versions, labels, and variables.

from langfuse import Langfuse

langfuse = Langfuse()

# Get the production version of a prompt
prompt = langfuse.get_prompt(
    name="ticket-classifier",
    label="production"  # or "staging", "latest"
)

# Prompt with variables
system_message = prompt.compile(
    categories="billing,technical,general,urgent",
    language="en"
)

Prompt structure in Langfuse:

Field	Purpose	Example
`name`	Unique identifier	`ticket-classifier`
`version`	Auto-increment	`14`
`label`	Environment / status	`production`, `staging`
`type`	Format	`text` or `chat`
`config`	Model parameters	`{"model": "gpt-5.4-mini", "temperature": 0}`

The prompt is decoupled from code. A product manager edits the prompt in the UI, assigns the staging label, tests it, and switches to production. The application code stays the same.

Approach 2: Prompts-as-Code

For teams that prefer Git as the single source of truth:

prompts/
├── ticket-classifier/
│   ├── prompt.yaml
│   ├── config.yaml
│   └── tests/
│       ├── dataset.jsonl
│       └── eval.py
├── summarizer/
│   ├── prompt.yaml
│   ├── config.yaml
│   └── tests/
└── prompt_registry.py

# prompts/ticket-classifier/prompt.yaml
name: ticket-classifier
type: chat
model: gpt-5.4-mini
temperature: 0
messages:
  - role: system
    content: |
      You are a support ticket classifier.
      Categories: {{categories}}.
      Return JSON: {"category": "...", "confidence": 0.0-1.0, "reasoning": "..."}
      Response language: {{language}}.
  - role: user
    content: "{{ticket_text}}"
variables:
  categories: "billing,technical,general,urgent"
  language: "en"

# prompt_registry.py
import yaml
from pathlib import Path

class PromptRegistry:
    def __init__(self, prompts_dir: str = "prompts"):
        self.prompts_dir = Path(prompts_dir)
        self._cache = {}

    def get(self, name: str) -> dict:
        if name not in self._cache:
            prompt_path = self.prompts_dir / name / "prompt.yaml"
            with open(prompt_path) as f:
                self._cache[name] = yaml.safe_load(f)
        return self._cache[name]

    def compile(self, name: str, **variables) -> list[dict]:
        prompt = self.get(name)
        messages = []
        for msg in prompt["messages"]:
            content = msg["content"]
            for key, value in {**prompt.get("variables", {}), **variables}.items():
                content = content.replace(f"{{{{{key}}}}}", str(value))
            messages.append({"role": msg["role"], "content": content})
        return messages

Both approaches support a hybrid variant: prompts live in Git, and CI/CD syncs them to Langfuse on every merge to main.

# ci/sync_prompts.py — called in CI pipeline
from langfuse import Langfuse
from prompt_registry import PromptRegistry

langfuse = Langfuse()
registry = PromptRegistry()

for prompt_name in ["ticket-classifier", "summarizer", "response-generator"]:
    prompt_data = registry.get(prompt_name)
    langfuse.create_prompt(
        name=prompt_name,
        prompt=prompt_data["messages"],
        config={"model": prompt_data["model"], "temperature": prompt_data["temperature"]},
        labels=["production"],
    )

Testing: Eval Before Deploying a Prompt

A prompt without tests is a gamble. Every change can silently break edge cases. Automated evaluation before deployment catches regressions before they reach users.

Datasets: The Gold Standard

Every prompt needs a test dataset. Minimum size: 20–30 examples covering the main scenarios and edge cases.

{"input": "Can't process payment, card is being declined", "expected": {"category": "billing", "confidence_min": 0.8}}
{"input": "App crashes when opening the chat", "expected": {"category": "technical", "confidence_min": 0.8}}
{"input": "I want to delete my account and all my data", "expected": {"category": "general", "confidence_min": 0.7}}
{"input": "URGENT! Server is down, customers can't log in", "expected": {"category": "urgent", "confidence_min": 0.9}}

Dataset sources:

Production logs. Real requests with labeled responses. The most valuable source.
Manual labeling. For new prompts with no production data yet.
Synthetic data. An LLM generates variations of existing examples. Useful for expanding edge case coverage.

Eval Pipeline

import json
from openai import OpenAI
from prompt_registry import PromptRegistry

client = OpenAI()
registry = PromptRegistry()

def evaluate_prompt(prompt_name: str, dataset_path: str, threshold: float = 0.85):
    """Evaluate a prompt against a dataset. Return pass/fail."""
    with open(dataset_path) as f:
        examples = [json.loads(line) for line in f]

    correct = 0
    total = len(examples)
    failures = []

    for example in examples:
        messages = registry.compile(prompt_name, ticket_text=example["input"])
        response = client.chat.completions.create(
            model=registry.get(prompt_name)["model"],
            messages=messages,
            temperature=0,
        )

        result = json.loads(response.choices[0].message.content)

        if result["category"] == example["expected"]["category"]:
            if result["confidence"] >= example["expected"]["confidence_min"]:
                correct += 1
            else:
                failures.append({
                    "input": example["input"],
                    "reason": f"low confidence: {result['confidence']}",
                })
        else:
            failures.append({
                "input": example["input"],
                "reason": f"wrong category: {result['category']}",
            })

    accuracy = correct / total
    passed = accuracy >= threshold

    return {
        "accuracy": accuracy,
        "threshold": threshold,
        "passed": passed,
        "failures": failures,
    }

For complex cases, LLM-as-Judge fits well. A judge model evaluates response quality against defined criteria: relevance, completeness, tone.

CI/CD Integration

# .github/workflows/prompt-eval.yml
name: Prompt Evaluation
on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install openai langfuse pyyaml

      - name: Run prompt evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python ci/eval_prompts.py --changed-only

      - name: Comment PR with results
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval_results.json'));
            let body = '## Prompt Eval Results\n\n';
            for (const [name, result] of Object.entries(results)) {
              const status = result.passed ? '✅' : '❌';
              body += `| ${name} | ${status} | ${result.accuracy.toFixed(2)} | ${result.threshold} |\n`;
            }
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body
            });

Every PR touching prompts automatically runs the eval pipeline and posts results as a comment.

Deploy: Shipping Prompts Without Deploying Code

Three strategies for delivering a new prompt version to production.

Instant Switch

The simplest option. Flip the production label to a new prompt version.

# In Langfuse UI: assign label "production" to prompt v14
# The app picks it up automatically on the next request

prompt = langfuse.get_prompt(
    name="ticket-classifier",
    label="production",
    cache_ttl_seconds=300,  # 5-minute cache
)

Good for non-critical prompts and quick fixes. Risk: 100% of traffic immediately hits the new version.

Canary Deploy

Gradual traffic shift: 5% → 25% → 50% → 100%.

import random

def get_prompt_with_canary(
    name: str,
    canary_percentage: int = 10,
) -> tuple[dict, str]:
    """Return a prompt and its version (production or canary)."""
    if random.randint(1, 100) <= canary_percentage:
        prompt = langfuse.get_prompt(name=name, label="canary")
        return prompt, "canary"
    else:
        prompt = langfuse.get_prompt(name=name, label="production")
        return prompt, "production"

Canary and production metrics are compared in real time. If canary degrades — automatic rollback.

Feature Flags

For teams with an existing feature flag system (LaunchDarkly, Unleash, or homegrown):

def get_prompt_version(name: str, user_id: str) -> str:
    """Determine the prompt version via feature flag."""
    flag = feature_flags.get(f"prompt_{name}_version")

    if flag.is_enabled(user_id):
        return flag.get_variant(user_id)  # "v14", "v15"
    return "production"

You can also target specific users, segments, or regions.

Monitor: Tying Metrics to Prompt Versions

Monitoring without version context is useless. Quality dropped — but what broke: the prompt, the model, the data?

Tracing with Prompt Version

Every LLM call should include the prompt version in metadata:

trace = langfuse.trace(
    name="ticket-classification",
    metadata={
        "prompt_name": "ticket-classifier",
        "prompt_version": prompt.version,      # 14
        "prompt_label": "production",
        "model": "gpt-5.4-mini",
    },
)

generation = trace.generation(
    name="classify",
    model="gpt-5.4-mini",
    prompt=prompt,  # Langfuse automatically links the version
    input=messages,
    output=response,
)

Version Dashboard

Key metrics to monitor:

Metric	What it shows	Alert when
Accuracy	Fraction of correct responses	< threshold for prompt
Latency p95	Response time	> 2x baseline
Token usage	Token consumption	> 1.5x vs previous version
Error rate	Fraction of invalid responses	> 5%
Cost per request	Cost per call	> budget

# Example: automatic comparison of two prompt versions
def compare_prompt_versions(
    prompt_name: str,
    version_a: int,
    version_b: int,
    metric: str = "accuracy",
) -> dict:
    """Compare metrics for two prompt versions from Langfuse."""
    traces_a = langfuse.fetch_traces(
        name=f"{prompt_name}-eval",
        metadata={"prompt_version": version_a},
        limit=1000,
    )
    traces_b = langfuse.fetch_traces(
        name=f"{prompt_name}-eval",
        metadata={"prompt_version": version_b},
        limit=1000,
    )

    scores_a = [t.scores[metric] for t in traces_a if metric in t.scores]
    scores_b = [t.scores[metric] for t in traces_b if metric in t.scores]

    return {
        "version_a": {"version": version_a, "mean": sum(scores_a) / len(scores_a)},
        "version_b": {"version": version_b, "mean": sum(scores_b) / len(scores_b)},
        "diff": (sum(scores_b) / len(scores_b)) - (sum(scores_a) / len(scores_a)),
    }

Regression Alerts

# Check metrics every 15 minutes (cron job or Langfuse webhook)
def check_prompt_regression(prompt_name: str):
    current_version = langfuse.get_prompt(name=prompt_name, label="production").version
    recent_scores = get_recent_scores(prompt_name, current_version, hours=1)

    baseline = get_baseline_scores(prompt_name, current_version)

    if recent_scores["accuracy"] < baseline["accuracy"] * 0.9:  # > 10% degradation
        alert(
            channel="slack",
            message=f"Regression detected: {prompt_name} v{current_version}. "
                    f"Accuracy: {recent_scores['accuracy']:.2f} "
                    f"(baseline: {baseline['accuracy']:.2f})",
        )
        # Automatic rollback to previous version
        rollback_prompt(prompt_name, to_version=current_version - 1)

Prompt Organization Patterns

Composition Over Monoliths

A 3,000-token monolithic prompt is hard to test and maintain. Break it into components:

# prompts/components/output-format.yaml
name: output-format-json
content: |
  Respond STRICTLY in JSON. No text before or after the JSON.
  If you cannot determine the answer, return {"error": "unable to classify"}.

# prompts/components/language-rules.yaml
name: language-rules
content: |
  Response language: {{language}}.
  Do not translate proper nouns or technical terms.

def compose_prompt(*component_names: str, **variables) -> str:
    """Assemble a prompt from components."""
    parts = []
    for name in component_names:
        component = registry.get(f"components/{name}")
        content = component["content"]
        for key, value in variables.items():
            content = content.replace(f"{{{{{key}}}}}", str(value))
        parts.append(content)
    return "\n\n".join(parts)

# Usage
system_prompt = compose_prompt(
    "ticket-classifier-core",
    "output-format-json",
    "language-rules",
    categories="billing,technical,general",
    language="en",
)

Naming Convention

At 50+ prompts, consistent naming matters:

{domain}-{task}-{variant}

ticket-classifier-v2
ticket-classifier-multilingual
order-summarizer-short
order-summarizer-detailed
response-generator-formal
response-generator-casual
quality-judge-relevance
quality-judge-toxicity

Prompt Metadata

Each prompt should carry metadata for auditing:

name: ticket-classifier
metadata:
  owner: ml-team
  created: 2026-01-15
  last_tested: 2026-03-20
  model_compatibility:
    - gpt-5.4-mini
    - claude-3-5-sonnet-20241022
  avg_tokens: 450
  cost_per_call_usd: 0.002
  test_accuracy: 0.92
  dataset_size: 150

Scaling: From 5 to 500 Prompts

How the system evolves as the number of prompts grows:

Scale	Registry	Testing	Deploy	Monitor
5–10 prompts	YAML in Git	Manual eval	Instant switch	Logs
10–50 prompts	Langfuse + Git sync	CI eval pipeline	Canary	Version dashboard
50–200 prompts	Langfuse + RBAC	CI + LLM-as-Judge	Feature flags	Alerts + auto-rollback
200+ prompts	Custom registry	Eval platform	Progressive rollout	ML monitoring

Key thresholds:

10 prompts — you need a registry. Prompts in code become unmanageable.

30 prompts — you need CI eval. Manual testing doesn’t scale; regressions slip through.

50 prompts — you need RBAC. Different teams own different prompts; access control becomes non-optional.

100 prompts — you need auto-rollback. Humans can’t respond to regressions fast enough in real time.

Prompt Management Tools

Tool	Type	Strengths
Langfuse	Open-source	Prompt management + tracing + evals in one. Self-hostable
PromptLayer	SaaS	Specialized in prompt management. Good UI
Humanloop	SaaS	Prompt management + eval + annotation. Enterprise
Pezzo	Open-source	Prompt management. Lightweight
Custom	Custom	Git + YAML + CI scripts. Maximum control

Langfuse covers most scenarios: registry with versioning, prompt-to-trace linking, dataset-based evals, MCP server for IDE management. Detailed walkthrough in the Langfuse guide.

Common Mistakes

Prompts in .env or config files. No versioning, no testing, no connection to metrics. Fine for prototypes, falls apart in production.

Testing on three examples. The prompt passes three tests and ships to production. A week later you discover it breaks on long inputs or edge case categories.

No baseline. The new prompt version “works well.” Without a baseline, there’s nothing to compare against. The previous version may have been better.

Optimizing tokens at the expense of quality. Prompt reduced from 800 to 300 tokens. Cost drops 60%. Accuracy drops from 0.94 to 0.81. Saving $50/month costs dozens of wrong responses every day.

Context Engineering for Prompts

A prompt doesn’t exist in isolation. Quality depends on what’s fed alongside it: context engineering determines which data enters the context window and in what order.

Three rules for production prompts:

Variables instead of hardcoded values. Anything that might change (categories, languages, formats) goes into variables. The prompt stays stable.
Few-shot examples at the end. Models “see” the end of the context more clearly. Placing examples after instructions improves accuracy.
Minimal context. Every extra token in the prompt dilutes the model’s attention. If an instruction doesn’t affect quality — remove it.

Where to Start

Week 1. Inventory. Collect all prompts from your codebase into one place — YAML files in Git or Langfuse. Standardize the format: name, version, model, messages, variables.

Week 2. Datasets. For each prompt, collect 20–30 test examples from production logs. Label the expected output.

Week 3. Eval pipeline. A script that runs the prompt against the dataset and outputs accuracy. Triggered in CI when prompts change.

Week 4. Monitoring. Prompt version in every trace’s metadata. Dashboard with metrics per version. Alert on > 10% degradation.

After a month — a working system where every prompt change is tested, versioned, and monitored. No chaos, no regressions, no “who changed this prompt?”

Need help building a prompt management system? I help startups build AI products and automate processes — belov.works.

FAQ

How large does a test dataset need to be before eval results become statistically reliable?

With 20–30 examples you can catch obvious regressions — a prompt that drops from 92% to 70% accuracy. To detect 3–5 percentage point changes (which matter at scale), you need 150–200 labeled examples. The practical threshold: for classification prompts, 100 examples gives you a margin of error of ±5% at 95% confidence; for generation tasks evaluated by LLM-as-Judge, 50 examples is usually sufficient because judge scores correlate more strongly and you’re measuring mean rather than pass/fail rate.

Can `cache_ttl_seconds` on Langfuse prompt fetches cause a canary rollout to split traffic unevenly?

Yes. If your prompt cache TTL is 300 seconds and you have 20 application instances, each instance caches its fetched version independently at a random point in its TTL cycle. During a canary deploy, some instances will serve the old production prompt for up to 5 minutes after you flip the label. For a meaningful 5%/95% canary split, set cache_ttl_seconds=0 or use a short TTL (30–60s) during the rollout window, then restore the longer TTL once the canary has stabilized.

What is the right threshold for auto-rollback, and how do you avoid false positives from low traffic windows?

A flat percentage threshold (e.g., rollback if accuracy drops 10%) fires false positives during overnight low-traffic windows where a sample of 5 requests determines the verdict. The practical pattern is to combine a minimum sample size guard with the percentage threshold: only trigger rollback if accuracy dropped by 10% AND the observation window contains at least 50 requests. For prompts with less than 50 requests per hour, shift to a daily comparison against the rolling 7-day baseline rather than a 1-hour window.