Back to Blog
DSPyLLMPrompt OptimizationPythonAIMachine LearningRAGStanfordLLM ProgrammingCompiler

DSPy — Systematic Prompt Optimization and Declarative LLM Programming

A practical guide to DSPy for building and optimizing LLM programs without manual prompt engineering: Signatures as typed I/O specifications that let DSPy infer prompt format and instructions instead of handcrafted strings, the three core Modules — dspy.Predict for direct completion, dspy.ChainOfThought adding a rationale reasoning step before the final output, and dspy.ReAct for tool-using agents that interleave reasoning and action steps — each parameterised by a Signature, composing Modules into Programs by extending dspy.Module with named sub-module attributes that the compiler discovers and optimizes independently, dspy.Example dataset construction with with_inputs() declarations separating input fields from ground-truth labels, metric functions as Python callables returning a float in [0,1] that the optimizer maximises across the development set, LLM-as-judge metric patterns using a ChainOfThought judge signature for generation tasks where exact match is insufficient, BootstrapFewShot for fast first-iteration optimization that runs the program on training examples, collects successful traces, and uses them as demonstrations without additional LLM calls, MIPROv2 as the recommended production optimizer that additionally searches over instruction text using a meta-program generating and evaluating candidate instructions with auto='medium' budget control and num_trials configuration, multi-hop RAG pipeline construction with a ChromadbRM retriever, a query refinement ChainOfThought, and a GenerateAnswer reader all compiled together so the optimizer tunes both retrieval and synthesis prompts for the target corpus, TypedPredictor wrapping any Signature with Pydantic model validation and max_retries backtracking that feeds validation errors back to the model as correction prompts, TypedChainOfThought for reasoning-before-structured-output patterns, dspy.Assert for hard security constraints like SELECT-only SQL that trigger backtracking retries on violation, dspy.Suggest for soft quality constraints that degrade gracefully without raising exceptions, compiled program serialization to JSON artifacts as versioned prompt stores committed alongside code, FastAPI serving patterns loading the compiled program once at startup, and CI evaluation gates scoring sampled predictions against a metric threshold before merging new compiled artifacts.

2026-06-28

Why Hand-Written Prompts Break at Scale

A carefully crafted prompt that achieves 78% accuracy on your test set today will degrade silently when the underlying model is updated, the input distribution shifts, or a colleague edits the wording for clarity. Prompt engineering has no feedback loop — you iterate manually, there is no reproducibility guarantee, and the optimisation is invisible to version control beyond a string diff. The result is LLM applications that are fragile by default.

DSPy, released from Stanford NLP, reframes the problem. Instead of writing prompts, you declare the task as a typed Signature — a mapping from named input fields to named output fields with optional descriptions. Instead of prompt strings, you compose Modules that turn Signatures into parameterised LLM calls. And instead of manually tuning, you run a Compiler (optimizer) that searches over prompt instructions and few-shot examples to maximise a metric you define. Where traditional prompt engineering for enterprise focuses on structured output schemas, tool use, and guardrail design — DSPy adds a systematic search layer on top, turning the art of prompting into an optimization problem with measurable progress.

Declarative

Define what you want (Signature), not how to say it (prompt). DSPy infers the prompt format, instruction text, and few-shot examples automatically during compilation.

Composable

Build complex pipelines by chaining Modules — Predict, ChainOfThought, ReAct, Retrieve — each with its own Signature. Swap components without rewriting downstream logic.

Optimizable

Compilers like BootstrapFewShot and MIPROv2 search for the best prompt instructions and demonstrations for your metric. Recompile when the model or task changes.

Installation and LM Configuration

DSPy is installed from PyPI and requires Python 3.10+. It supports all major LLM providers through a unified dspy.LM interface backed by LiteLLM, so the same code runs against OpenAI, Anthropic, local Ollama models, or any OpenAI-compatible endpoint by changing a single model string.

pip install dspy

# Optional — for local Ollama models
pip install dspy[ollama]
import dspy
import os

# OpenAI
lm = dspy.LM(
    model="openai/gpt-4o-mini",
    api_key=os.environ["OPENAI_API_KEY"],
    max_tokens=2048,
    temperature=0.0,  # deterministic for optimization passes
)

# Anthropic Claude
lm = dspy.LM(
    model="anthropic/claude-sonnet-4-6",
    api_key=os.environ["ANTHROPIC_API_KEY"],
    max_tokens=4096,
)

# Local Ollama (no API key required)
lm = dspy.LM(
    model="ollama/llama3.2",
    api_base="http://localhost:11434",
    max_tokens=2048,
)

# Set globally — all modules use this unless overridden
dspy.configure(lm=lm)

# Per-module override for different temperature profiles
dspy.configure(lm=lm, temperature=0.7)  # module-level

Note

Set temperature=0.0 during compilation (optimization passes) to make bootstrap example selection deterministic. Once compilation is complete and you have a saved program, restore the temperature appropriate for your production use case. Stochastic temperatures during optimization cause the compiler to make inconsistent choices between iterations.

Signatures — Declarative I/O Specifications

A Signature is the type contract of an LLM operation. It specifies what fields go in, what fields come out, and optionally describes what each field contains. DSPy uses signatures to generate the prompt structure — field names, descriptions, and formatting — without you writing a single prompt word. Signatures can be defined inline as a string (for simple cases) or as a class (for production use with field descriptions).

import dspy

# Inline string signature — "inputs -> outputs"
# Field names become prompt labels; comma-separates multiple fields
class BasicQA(dspy.Signature):
    """Answer questions with short factual responses."""
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="A concise, factual answer in one sentence.")

# Multi-input, multi-output signature
class SentimentAndSummary(dspy.Signature):
    """Analyze a product review for sentiment and key points."""
    review_text: str = dspy.InputField(desc="Raw product review from the customer.")
    product_category: str = dspy.InputField(desc="Product category (e.g. Electronics, Clothing).")
    sentiment: str = dspy.OutputField(desc="One of: positive, negative, neutral.")
    summary: str = dspy.OutputField(desc="Two-sentence summary of the main points raised.")
    confidence: float = dspy.OutputField(desc="Confidence score from 0.0 to 1.0.")

# Classification with a constrained output
class IntentClassifier(dspy.Signature):
    """Classify customer support message intent."""
    message: str = dspy.InputField()
    intent: str = dspy.OutputField(
        desc="One of: billing, technical_support, account_management, general_inquiry, complaint"
    )
    urgency: str = dspy.OutputField(desc="One of: low, medium, high, critical")

Modules — Predict, ChainOfThought, and ReAct

Modules are parameterised LLM calls that take a Signature and implement a strategy for answering it. The core modules cover most production use cases: Predict for direct completion, ChainOfThought for step-by-step reasoning, and ReActfor tool-using agents. Each module's internal prompt is adjusted by the compiler during optimization — you never edit the prompt string directly.

import dspy

# --- Predict: direct completion ---
# Generates the output fields directly from the input
predict = dspy.Predict(BasicQA)
result = predict(question="What is the capital of France?")
print(result.answer)  # "The capital of France is Paris."

# --- ChainOfThought: adds a 'rationale' step before the final output ---
# Forces the model to reason step-by-step, improving accuracy on complex tasks
cot = dspy.ChainOfThought(BasicQA)
result = cot(question="If a train leaves at 9am going 60mph and another at 10am going 80mph, when do they meet?")
print(result.rationale)  # step-by-step working
print(result.answer)

# --- ChainOfThoughtWithHint: pass a hint to guide reasoning ---
class QAWithHint(dspy.Signature):
    question: str = dspy.InputField()
    hint: str = dspy.InputField(desc="A partial clue toward the answer.")
    answer: str = dspy.OutputField()

cot_hint = dspy.ChainOfThought(QAWithHint)

# --- ReAct: interleave reasoning and tool actions ---
# Define tools as plain Python functions with docstrings
def search_wikipedia(query: str) -> str:
    """Search Wikipedia for factual information. Returns a short passage."""
    import wikipedia
    try:
        return wikipedia.summary(query, sentences=3)
    except Exception:
        return "No results found."

def calculator(expression: str) -> str:
    """Evaluate a mathematical expression safely. Returns the numeric result."""
    try:
        result = eval(expression, {"__builtins__": {}})
        return str(result)
    except Exception as e:
        return f"Error: {e}"

class ResearchAssistant(dspy.Signature):
    """Answer research questions using search and calculation when needed."""
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="A well-sourced, accurate answer.")

react = dspy.ReAct(ResearchAssistant, tools=[search_wikipedia, calculator], max_iters=5)
result = react(question="What is the GDP per capita of Norway multiplied by 3?")
print(result.answer)

Composing Modules into Programs

Real LLM applications chain multiple operations: retrieve context, summarise it, classify intent, generate a response. DSPy programs are Python classes that extend dspy.Module, declare their sub-modules as instance attributes (so the compiler can discover and optimize them), and implement the logic in forward(). The compiler treats each declared sub-module as an independent optimization target.

import dspy

class SupportTicketRouter(dspy.Module):
    """Classify a support ticket and generate a recommended action."""

    def __init__(self):
        # Declare sub-modules as instance attributes so the compiler finds them
        self.classify = dspy.Predict(IntentClassifier)
        self.generate_action = dspy.ChainOfThought(ActionRecommender)

    def forward(self, ticket_text: str, customer_tier: str) -> dspy.Prediction:
        # Step 1: classify intent and urgency
        classification = self.classify(message=ticket_text)

        # Step 2: generate recommended action using classification as context
        action = self.generate_action(
            ticket=ticket_text,
            intent=classification.intent,
            urgency=classification.urgency,
            customer_tier=customer_tier,
        )

        return dspy.Prediction(
            intent=classification.intent,
            urgency=classification.urgency,
            recommended_action=action.recommended_action,
            escalate=action.escalate,
        )

class ActionRecommender(dspy.Signature):
    """Recommend a support action based on ticket classification."""
    ticket: str = dspy.InputField()
    intent: str = dspy.InputField()
    urgency: str = dspy.InputField()
    customer_tier: str = dspy.InputField(desc="One of: free, pro, enterprise")
    recommended_action: str = dspy.OutputField(desc="Specific next step for the support agent.")
    escalate: bool = dspy.OutputField(desc="True if this requires manager escalation.")

# Use the program
router = SupportTicketRouter()
result = router(
    ticket_text="My production API is returning 500 errors for all requests since 2pm.",
    customer_tier="enterprise",
)
print(result.intent)            # technical_support
print(result.urgency)           # critical
print(result.recommended_action)
print(result.escalate)          # True

Evaluation Datasets and Metrics

Optimization requires a labelled dataset and a metric function. The dataset is a list of dspy.Example objects — each containing input fields and optionally a ground-truth label used for metric computation. The same principles that apply to general LLM evaluation — golden datasets, metric design, and regression testing — apply directly to DSPy: a metric is a Python callable that receives a prediction and an example and returns a score between 0 and 1. The optimizer maximises the average metric score across your development set.

import dspy

# Build a dataset of Examples
# with_inputs() declares which fields are inputs (vs. labels)
trainset = [
    dspy.Example(
        question="What year did the Berlin Wall fall?",
        answer="The Berlin Wall fell in 1989.",
    ).with_inputs("question"),
    dspy.Example(
        question="Who wrote Hamlet?",
        answer="Hamlet was written by William Shakespeare.",
    ).with_inputs("question"),
    # ... 50-200 examples recommended for BootstrapFewShot
]

devset = trainset[:20]   # held-out set for metric evaluation
trainset = trainset[20:] # training set for bootstrap

# --- Metric function ---
# Must return a float in [0, 1] or bool
# The optimizer maximises the average over the devset

def answer_exact_match(example: dspy.Example, pred: dspy.Prediction, trace=None) -> bool:
    """Check if key facts from the expected answer appear in the prediction."""
    expected = example.answer.lower()
    predicted = pred.answer.lower()
    # Simple substring check — replace with your domain-specific evaluation
    return expected in predicted or predicted in expected

# For generation tasks — LLM-as-judge metric
judge_sig = dspy.Signature(
    "question, expected_answer, predicted_answer -> correct: bool, reason: str"
)
judge = dspy.ChainOfThought(judge_sig)

def llm_judge_metric(example: dspy.Example, pred: dspy.Prediction, trace=None) -> float:
    judgment = judge(
        question=example.question,
        expected_answer=example.answer,
        predicted_answer=pred.answer,
    )
    return float(judgment.correct)

# Run evaluation before optimization (baseline)
evaluator = dspy.Evaluate(devset=devset, metric=answer_exact_match, num_threads=4)
baseline_program = dspy.Predict(BasicQA)
baseline_score = evaluator(baseline_program)
print(f"Baseline accuracy: {baseline_score:.1%}")

Compilers — BootstrapFewShot and MIPROv2

DSPy compilers (optimizers) take a program and a metric, then search for better prompt parameters — instruction text, few-shot demonstrations, or both. BootstrapFewShot is the fastest option: it runs the program on training examples, collects successful traces, and uses them as demonstrations in the prompt. MIPROv2 is the recommended production optimizer — it additionally searches over instruction text using a meta-program that generates and evaluates candidate instructions.

import dspy
from dspy.teleprompt import BootstrapFewShot, MIPROv2

program = SupportTicketRouter()

# --- BootstrapFewShot ---
# Fast, cheap — no additional LLM calls beyond your metric evaluation
# Good for: first iteration, tight cost budgets, simple programs
bootstrap_optimizer = BootstrapFewShot(
    metric=answer_exact_match,
    max_bootstrapped_demos=4,     # max few-shot examples per module
    max_labeled_demos=4,          # max labeled examples from trainset
    max_rounds=1,                 # number of bootstrap iterations
)

compiled_program = bootstrap_optimizer.compile(
    student=program,
    trainset=trainset,
)

# Evaluate after optimization
score = evaluator(compiled_program)
print(f"After BootstrapFewShot: {score:.1%}")

# --- MIPROv2 ---
# Slower, more thorough — searches instruction space + demonstrations
# Recommended for production — typically 5-15% accuracy gains over bootstrap
mipro_optimizer = MIPROv2(
    metric=answer_exact_match,
    auto="medium",            # "light" | "medium" | "heavy" — controls budget
    num_threads=8,            # parallel candidate evaluation
    verbose=True,             # show optimization progress
)

compiled_program_mipro = mipro_optimizer.compile(
    student=program,
    trainset=trainset,
    valset=devset,
    num_trials=30,            # number of instruction candidates to evaluate
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
    requires_permission_to_run=False,  # skip interactive confirmation
)

mipro_score = evaluator(compiled_program_mipro)
print(f"After MIPROv2: {mipro_score:.1%}")

# Save the compiled program — stores optimized prompts as JSON
compiled_program_mipro.save("support_router_optimized.json")

# Load it back
restored = SupportTicketRouter()
restored.load("support_router_optimized.json")

Note

MIPROv2 makes multiple LLM calls per trial to generate and evaluate instruction candidates. A medium budget with 30 trials on a GPT-4o-mini backend costs approximately $1–5 per compilation run. Heavy budgets or using GPT-4o as the optimizer LM can reach $20–50. Use a cheaper model for the optimizer and a capable model for the student program — configure them separately with dspy.configure(lm=...) inside the optimizer call if needed.

Building and Optimizing a RAG Pipeline

DSPy's most common production pattern is an optimized Retrieval-Augmented Generation pipeline. The retriever is wrapped as a DSPy module, the reader is a ChainOfThought module, and the compiler automatically tunes both the query generation and answer synthesis prompts for your specific document corpus and question distribution. The retrieval quality fundamentals — chunking strategy, embedding choice, and index type — still apply — DSPy optimizes the prompts around whatever retriever you plug in, but a poor retriever still produces poor context.

import dspy
from dspy.retrieve import ChromadbRM  # or any retriever adapter

# Configure retriever (Chroma example)
retriever = ChromadbRM(
    collection_name="knowledge_base",
    persist_directory="./chroma_db",
    embedding_function=None,  # uses Chroma default (sentence-transformers)
    k=5,                      # top-k passages per query
)
dspy.configure(rm=retriever)

# Signature for the reader
class GenerateAnswer(dspy.Signature):
    """Answer the question using only information from the retrieved context."""
    context: list[str] = dspy.InputField(desc="Relevant passages retrieved from the knowledge base.")
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="A precise answer citing specific facts from the context.")
    citations: str = dspy.OutputField(desc="Which passage(s) support the answer (quote key phrases).")

# Multi-hop RAG: retrieve, refine query, retrieve again, answer
class MultiHopRAG(dspy.Module):
    def __init__(self, num_passages: int = 5, hops: int = 2):
        self.retrieve = dspy.Retrieve(k=num_passages)
        # Query refinement: generate a better search query based on initial context
        self.refine_query = dspy.ChainOfThought("context, question -> search_query")
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.hops = hops

    def forward(self, question: str) -> dspy.Prediction:
        context = []

        for hop in range(self.hops):
            if hop == 0:
                # First hop: use the raw question
                passages = self.retrieve(question).passages
            else:
                # Subsequent hops: refine the search query based on gathered context
                refined = self.refine_query(
                    context=context,
                    question=question,
                )
                passages = self.retrieve(refined.search_query).passages

            context.extend(passages)

        # Deduplicate while preserving order
        seen = set()
        context = [p for p in context if not (p in seen or seen.add(p))]

        result = self.generate_answer(context=context, question=question)
        return dspy.Prediction(answer=result.answer, citations=result.citations, context=context)

# Metric: answer matches ground truth AND cites context
def rag_metric(example, pred, trace=None):
    answer_correct = example.answer.lower() in pred.answer.lower()
    has_citation = bool(pred.citations and len(pred.citations) > 20)
    return float(answer_correct) * 0.8 + float(has_citation) * 0.2

# Compile the multi-hop RAG
rag = MultiHopRAG(num_passages=5, hops=2)
optimizer = BootstrapFewShot(metric=rag_metric, max_bootstrapped_demos=3)
compiled_rag = optimizer.compile(rag, trainset=rag_trainset)
compiled_rag.save("multihop_rag.json")

TypedPredictor — Structured Outputs with Pydantic

For production systems that pipe LLM outputs into downstream code, untyped string outputs are fragile. DSPy's TypedPredictor validates outputs against Pydantic models at inference time and retries on validation failure — up to max_retries — feeding the validation error back to the model as a correction prompt.

import dspy
from pydantic import BaseModel, Field
from typing import Literal

class TicketAnalysis(BaseModel):
    intent: Literal["billing", "technical_support", "account_management", "complaint", "general"]
    urgency: Literal["low", "medium", "high", "critical"]
    sentiment_score: float = Field(ge=-1.0, le=1.0, description="Sentiment from -1 (negative) to 1 (positive)")
    key_issues: list[str] = Field(min_length=1, max_length=5)
    requires_human: bool

class TypedAnalyzer(dspy.Signature):
    """Analyze a customer support ticket and return structured JSON."""
    ticket: str = dspy.InputField()
    analysis: TicketAnalysis = dspy.OutputField()

# TypedPredictor wraps any Signature with Pydantic validation + retry
analyzer = dspy.TypedPredictor(TypedAnalyzer, max_retries=3)

result = analyzer(ticket="My invoice is wrong and I've been charged twice!")
analysis: TicketAnalysis = result.analysis
print(analysis.intent)         # "billing"
print(analysis.urgency)        # "high"
print(analysis.sentiment_score) # -0.7
print(analysis.key_issues)     # ["double charge", "incorrect invoice"]
print(analysis.requires_human) # True

# Also works with ChainOfThought for reasoning before structured output
typed_cot = dspy.TypedChainOfThought(TypedAnalyzer, max_retries=3)
result = typed_cot(ticket="System down, 500 errors for all enterprise users")
print(result.analysis.urgency)  # "critical"

Assertions and Suggestions — Constraining Outputs

Some constraints can't be captured by a Pydantic type — for example, ensuring a generated SQL query only selects from allowed tables, or that a summary is under 100 words. DSPy provides two runtime constraint mechanisms: dspy.Assert raises an exception and forces a retry with the constraint violation fed back to the model; dspy.Suggest does the same but degrades gracefully — it logs the violation and returns the (possibly violating) output rather than raising.

import dspy

ALLOWED_TABLES = {"orders", "customers", "products", "inventory"}

class SQLQuerySig(dspy.Signature):
    """Generate a read-only SQL SELECT query for the given request."""
    request: str = dspy.InputField(desc="Natural language data request.")
    allowed_tables: str = dspy.InputField(desc="Comma-separated list of tables available.")
    sql: str = dspy.OutputField(desc="A valid SQL SELECT statement.")

class SafeSQLGenerator(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought(SQLQuerySig)

    def forward(self, request: str) -> dspy.Prediction:
        result = self.generate(
            request=request,
            allowed_tables=", ".join(ALLOWED_TABLES),
        )
        sql = result.sql.strip()

        # Hard constraint — fail and retry if violated
        dspy.Assert(
            sql.strip().lower().startswith("select"),
            "Query must be a SELECT statement — no INSERT, UPDATE, DELETE, or DROP allowed.",
        )

        # Check no disallowed tables are referenced
        sql_lower = sql.lower()
        disallowed = [t for t in ("users_pii", "payment_methods", "audit_log") if t in sql_lower]
        dspy.Assert(
            len(disallowed) == 0,
            f"Query references restricted tables: {disallowed}. Use only: {ALLOWED_TABLES}",
        )

        # Soft constraint — warn but don't block
        dspy.Suggest(
            len(sql) < 2000,
            "Query is very long — consider simplifying or breaking into CTEs.",
        )

        return dspy.Prediction(sql=sql)

# Enable assert-based backtracking (up to 3 retries on assertion failures)
generator = dspy.assert_transform_module(SafeSQLGenerator(), max_backtracks=3)
result = generator(request="Show me top 10 customers by order value this month")
print(result.sql)

Saving, Loading, and Serving Compiled Programs

A compiled DSPy program serializes its optimized state — few-shot demonstrations, instruction text, and module configuration — to JSON. This JSON artifact is your versioned prompt artifact: commit it to Git alongside your code, load it in CI for evaluation gating, and deploy it with your application without a compilation step at startup.

import dspy
import json
from pathlib import Path

# Save a compiled program
compiled_program.save("artifacts/support_router_v2.json")

# Load it back — must use the same Program class
program = SupportTicketRouter()
program.load("artifacts/support_router_v2.json")

# Inspect what the optimizer found
for name, module in program.named_predictors():
    print(f"\n=== {name} ===")
    print(f"Instructions: {module.signature.instructions}")
    print(f"Demonstrations: {len(module.demos)}")
    for i, demo in enumerate(module.demos):
        print(f"  Demo {i}: {list(demo.keys())}")

# FastAPI serving — load once at startup, reuse across requests
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel as PydanticBase

app = FastAPI()
_program: SupportTicketRouter | None = None

@app.on_event("startup")
async def load_program():
    global _program
    lm = dspy.LM(model="openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
    dspy.configure(lm=lm)
    _program = SupportTicketRouter()
    _program.load("artifacts/support_router_v2.json")

class TicketRequest(PydanticBase):
    ticket_text: str
    customer_tier: str = "free"

@app.post("/classify")
async def classify_ticket(req: TicketRequest):
    if _program is None:
        raise HTTPException(503, "Program not loaded")
    result = _program(ticket_text=req.ticket_text, customer_tier=req.customer_tier)
    return {
        "intent": result.intent,
        "urgency": result.urgency,
        "recommended_action": result.recommended_action,
        "escalate": result.escalate,
    }
# CI evaluation gate — run before merging a new compiled artifact
# ci/evaluate_program.py

import dspy, sys, json
from pathlib import Path

lm = dspy.LM(model="openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)

program = SupportTicketRouter()
program.load("artifacts/support_router_v2.json")

evaluator = dspy.Evaluate(devset=test_set, metric=answer_exact_match, num_threads=4)
score = evaluator(program)

print(f"Evaluation score: {score:.1%}")
THRESHOLD = 0.82
if score < THRESHOLD:
    print(f"FAIL — score {score:.1%} below threshold {THRESHOLD:.1%}")
    sys.exit(1)
print(f"PASS — score {score:.1%} meets threshold")
sys.exit(0)

Production Checklist

1

Version your compiled program JSON artifacts in Git alongside your code. The JSON file is your prompt artifact — it records the optimized instructions and demonstrations that produced a measured accuracy score. Without versioning, you lose the ability to roll back to a known-good prompt state when a model update or distribution shift degrades performance.

2

Separate training, development, and test splits strictly. The optimizer sees trainset, evaluates candidates on devset, and you report final numbers on testset — never use testset during compilation. Contaminating testset with devset examples inflates reported metrics and hides real-world performance gaps.

3

Run MIPROv2 with temperature=0.0 on the student LM for reproducibility. Stochastic sampling during bootstrap means the same trainset produces different demonstrations on each run. Deterministic compilation lets you compare results across runs, identify the true impact of dataset or configuration changes, and debug regressions.

4

Use a cheaper LM for optimization and a capable LM for the final program. BootstrapFewShot and MIPROv2 make hundreds of LLM calls per compilation. Running the optimizer against GPT-4o is 10-20x more expensive than gpt-4o-mini for equivalent optimization quality. Configure them separately: set the optimizer's LM budget on a cost-effective model and test the compiled program on your target production model.

5

Design metrics that proxy real business outcomes, not surface-level string matching. Exact match is only valid when answers are deterministic strings. For generation tasks, use LLM-as-judge metrics with explicit rubrics, or domain-specific metrics (F1 over extracted entities, valid JSON parse rate, SQL execution success). A metric that does not correlate with user satisfaction produces optimized-but-useless programs.

6

Start with 50-200 labeled examples for BootstrapFewShot, 200-500 for MIPROv2. Below 50, the optimizer has insufficient signal to select good demonstrations. Above 2000, the compilation cost grows without proportional accuracy gain — use a stratified sample of your distribution instead of the full dataset.

7

Use TypedPredictor for any output that feeds into downstream code. String outputs fail silently — a field that should be a float returns 'N/A' and your aggregation crashes at 3am. Pydantic validation with max_retries=3 and a correction prompt catches 90%+ of formatting failures without raising exceptions in production.

8

Add dspy.Assert for security-critical constraints, dspy.Suggest for quality constraints. Assert-based backtracking costs additional LLM calls per failed constraint. Use Assert sparingly for constraints that must never be violated (no DML SQL, no PII in responses). Use Suggest for constraints where a soft warning is sufficient and throughput matters.

9

Recompile when the base model changes. Compiled demonstrations are selected for a specific model's failure modes — demonstrations that patch GPT-4o weaknesses do not transfer to Claude or a fine-tuned model. Treat model upgrades as a trigger for recompilation, not just a configuration change.

10

Monitor accuracy drift in production by sampling predictions and scoring them with your metric. Compiled programs degrade as input distributions shift. A weekly evaluation job scoring 200 sampled production requests against ground truth (when available) or an LLM judge gives early warning before accuracy drops below threshold. Automated recompilation on drift detection closes the loop.

Spending engineering cycles hand-tuning prompt strings that break silently on model updates, unable to reproduce why accuracy was higher last month, or your LLM pipeline works in the notebook but degrades in production?

We design and implement DSPy-based LLM programs — from Signature and Module architecture design for your specific task, through training and development set construction with domain-appropriate metrics, BootstrapFewShot and MIPROv2 compilation with cost-optimized LM selection for the optimizer versus the student program, multi-hop RAG pipeline compilation with your retriever and corpus, TypedPredictor integration for Pydantic-validated structured outputs with retry backtracking, dspy.Assert constraint layers for security-critical output constraints, compiled JSON artifact versioning in Git with CI evaluation gates scoring against accuracy thresholds before deployment, FastAPI serving configuration loading compiled programs at startup, and production accuracy drift monitoring with automated recompilation triggers. Let’s talk.

Let's Talk

Related Articles

DataSOps Consulting

Need help implementing this in production?

We build and operate data pipelines, AI systems, and observability stacks for engineering teams. Reach out for a free 30-minute architecture review.