Why Hand-Written Prompts Break at Scale
A carefully crafted prompt that achieves 78% accuracy on your test set today will degrade silently when the underlying model is updated, the input distribution shifts, or a colleague edits the wording for clarity. Prompt engineering has no feedback loop — you iterate manually, there is no reproducibility guarantee, and the optimisation is invisible to version control beyond a string diff. The result is LLM applications that are fragile by default.
DSPy, released from Stanford NLP, reframes the problem. Instead of writing prompts, you declare the task as a typed Signature — a mapping from named input fields to named output fields with optional descriptions. Instead of prompt strings, you compose Modules that turn Signatures into parameterised LLM calls. And instead of manually tuning, you run a Compiler (optimizer) that searches over prompt instructions and few-shot examples to maximise a metric you define. Where traditional prompt engineering for enterprise focuses on structured output schemas, tool use, and guardrail design — DSPy adds a systematic search layer on top, turning the art of prompting into an optimization problem with measurable progress.
Declarative
Define what you want (Signature), not how to say it (prompt). DSPy infers the prompt format, instruction text, and few-shot examples automatically during compilation.
Composable
Build complex pipelines by chaining Modules — Predict, ChainOfThought, ReAct, Retrieve — each with its own Signature. Swap components without rewriting downstream logic.
Optimizable
Compilers like BootstrapFewShot and MIPROv2 search for the best prompt instructions and demonstrations for your metric. Recompile when the model or task changes.
Installation and LM Configuration
DSPy is installed from PyPI and requires Python 3.10+. It supports all major LLM providers through a unified dspy.LM interface backed by LiteLLM, so the same code runs against OpenAI, Anthropic, local Ollama models, or any OpenAI-compatible endpoint by changing a single model string.
pip install dspy
# Optional — for local Ollama models
pip install dspy[ollama]import dspy
import os
# OpenAI
lm = dspy.LM(
model="openai/gpt-4o-mini",
api_key=os.environ["OPENAI_API_KEY"],
max_tokens=2048,
temperature=0.0, # deterministic for optimization passes
)
# Anthropic Claude
lm = dspy.LM(
model="anthropic/claude-sonnet-4-6",
api_key=os.environ["ANTHROPIC_API_KEY"],
max_tokens=4096,
)
# Local Ollama (no API key required)
lm = dspy.LM(
model="ollama/llama3.2",
api_base="http://localhost:11434",
max_tokens=2048,
)
# Set globally — all modules use this unless overridden
dspy.configure(lm=lm)
# Per-module override for different temperature profiles
dspy.configure(lm=lm, temperature=0.7) # module-levelNote
temperature=0.0 during compilation (optimization passes) to make bootstrap example selection deterministic. Once compilation is complete and you have a saved program, restore the temperature appropriate for your production use case. Stochastic temperatures during optimization cause the compiler to make inconsistent choices between iterations.Signatures — Declarative I/O Specifications
A Signature is the type contract of an LLM operation. It specifies what fields go in, what fields come out, and optionally describes what each field contains. DSPy uses signatures to generate the prompt structure — field names, descriptions, and formatting — without you writing a single prompt word. Signatures can be defined inline as a string (for simple cases) or as a class (for production use with field descriptions).
import dspy
# Inline string signature — "inputs -> outputs"
# Field names become prompt labels; comma-separates multiple fields
class BasicQA(dspy.Signature):
"""Answer questions with short factual responses."""
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc="A concise, factual answer in one sentence.")
# Multi-input, multi-output signature
class SentimentAndSummary(dspy.Signature):
"""Analyze a product review for sentiment and key points."""
review_text: str = dspy.InputField(desc="Raw product review from the customer.")
product_category: str = dspy.InputField(desc="Product category (e.g. Electronics, Clothing).")
sentiment: str = dspy.OutputField(desc="One of: positive, negative, neutral.")
summary: str = dspy.OutputField(desc="Two-sentence summary of the main points raised.")
confidence: float = dspy.OutputField(desc="Confidence score from 0.0 to 1.0.")
# Classification with a constrained output
class IntentClassifier(dspy.Signature):
"""Classify customer support message intent."""
message: str = dspy.InputField()
intent: str = dspy.OutputField(
desc="One of: billing, technical_support, account_management, general_inquiry, complaint"
)
urgency: str = dspy.OutputField(desc="One of: low, medium, high, critical")Modules — Predict, ChainOfThought, and ReAct
Modules are parameterised LLM calls that take a Signature and implement a strategy for answering it. The core modules cover most production use cases: Predict for direct completion, ChainOfThought for step-by-step reasoning, and ReActfor tool-using agents. Each module's internal prompt is adjusted by the compiler during optimization — you never edit the prompt string directly.
import dspy
# --- Predict: direct completion ---
# Generates the output fields directly from the input
predict = dspy.Predict(BasicQA)
result = predict(question="What is the capital of France?")
print(result.answer) # "The capital of France is Paris."
# --- ChainOfThought: adds a 'rationale' step before the final output ---
# Forces the model to reason step-by-step, improving accuracy on complex tasks
cot = dspy.ChainOfThought(BasicQA)
result = cot(question="If a train leaves at 9am going 60mph and another at 10am going 80mph, when do they meet?")
print(result.rationale) # step-by-step working
print(result.answer)
# --- ChainOfThoughtWithHint: pass a hint to guide reasoning ---
class QAWithHint(dspy.Signature):
question: str = dspy.InputField()
hint: str = dspy.InputField(desc="A partial clue toward the answer.")
answer: str = dspy.OutputField()
cot_hint = dspy.ChainOfThought(QAWithHint)
# --- ReAct: interleave reasoning and tool actions ---
# Define tools as plain Python functions with docstrings
def search_wikipedia(query: str) -> str:
"""Search Wikipedia for factual information. Returns a short passage."""
import wikipedia
try:
return wikipedia.summary(query, sentences=3)
except Exception:
return "No results found."
def calculator(expression: str) -> str:
"""Evaluate a mathematical expression safely. Returns the numeric result."""
try:
result = eval(expression, {"__builtins__": {}})
return str(result)
except Exception as e:
return f"Error: {e}"
class ResearchAssistant(dspy.Signature):
"""Answer research questions using search and calculation when needed."""
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc="A well-sourced, accurate answer.")
react = dspy.ReAct(ResearchAssistant, tools=[search_wikipedia, calculator], max_iters=5)
result = react(question="What is the GDP per capita of Norway multiplied by 3?")
print(result.answer)Composing Modules into Programs
Real LLM applications chain multiple operations: retrieve context, summarise it, classify intent, generate a response. DSPy programs are Python classes that extend dspy.Module, declare their sub-modules as instance attributes (so the compiler can discover and optimize them), and implement the logic in forward(). The compiler treats each declared sub-module as an independent optimization target.
import dspy
class SupportTicketRouter(dspy.Module):
"""Classify a support ticket and generate a recommended action."""
def __init__(self):
# Declare sub-modules as instance attributes so the compiler finds them
self.classify = dspy.Predict(IntentClassifier)
self.generate_action = dspy.ChainOfThought(ActionRecommender)
def forward(self, ticket_text: str, customer_tier: str) -> dspy.Prediction:
# Step 1: classify intent and urgency
classification = self.classify(message=ticket_text)
# Step 2: generate recommended action using classification as context
action = self.generate_action(
ticket=ticket_text,
intent=classification.intent,
urgency=classification.urgency,
customer_tier=customer_tier,
)
return dspy.Prediction(
intent=classification.intent,
urgency=classification.urgency,
recommended_action=action.recommended_action,
escalate=action.escalate,
)
class ActionRecommender(dspy.Signature):
"""Recommend a support action based on ticket classification."""
ticket: str = dspy.InputField()
intent: str = dspy.InputField()
urgency: str = dspy.InputField()
customer_tier: str = dspy.InputField(desc="One of: free, pro, enterprise")
recommended_action: str = dspy.OutputField(desc="Specific next step for the support agent.")
escalate: bool = dspy.OutputField(desc="True if this requires manager escalation.")
# Use the program
router = SupportTicketRouter()
result = router(
ticket_text="My production API is returning 500 errors for all requests since 2pm.",
customer_tier="enterprise",
)
print(result.intent) # technical_support
print(result.urgency) # critical
print(result.recommended_action)
print(result.escalate) # TrueEvaluation Datasets and Metrics
Optimization requires a labelled dataset and a metric function. The dataset is a list of dspy.Example objects — each containing input fields and optionally a ground-truth label used for metric computation. The same principles that apply to general LLM evaluation — golden datasets, metric design, and regression testing — apply directly to DSPy: a metric is a Python callable that receives a prediction and an example and returns a score between 0 and 1. The optimizer maximises the average metric score across your development set.
import dspy
# Build a dataset of Examples
# with_inputs() declares which fields are inputs (vs. labels)
trainset = [
dspy.Example(
question="What year did the Berlin Wall fall?",
answer="The Berlin Wall fell in 1989.",
).with_inputs("question"),
dspy.Example(
question="Who wrote Hamlet?",
answer="Hamlet was written by William Shakespeare.",
).with_inputs("question"),
# ... 50-200 examples recommended for BootstrapFewShot
]
devset = trainset[:20] # held-out set for metric evaluation
trainset = trainset[20:] # training set for bootstrap
# --- Metric function ---
# Must return a float in [0, 1] or bool
# The optimizer maximises the average over the devset
def answer_exact_match(example: dspy.Example, pred: dspy.Prediction, trace=None) -> bool:
"""Check if key facts from the expected answer appear in the prediction."""
expected = example.answer.lower()
predicted = pred.answer.lower()
# Simple substring check — replace with your domain-specific evaluation
return expected in predicted or predicted in expected
# For generation tasks — LLM-as-judge metric
judge_sig = dspy.Signature(
"question, expected_answer, predicted_answer -> correct: bool, reason: str"
)
judge = dspy.ChainOfThought(judge_sig)
def llm_judge_metric(example: dspy.Example, pred: dspy.Prediction, trace=None) -> float:
judgment = judge(
question=example.question,
expected_answer=example.answer,
predicted_answer=pred.answer,
)
return float(judgment.correct)
# Run evaluation before optimization (baseline)
evaluator = dspy.Evaluate(devset=devset, metric=answer_exact_match, num_threads=4)
baseline_program = dspy.Predict(BasicQA)
baseline_score = evaluator(baseline_program)
print(f"Baseline accuracy: {baseline_score:.1%}")Compilers — BootstrapFewShot and MIPROv2
DSPy compilers (optimizers) take a program and a metric, then search for better prompt parameters — instruction text, few-shot demonstrations, or both. BootstrapFewShot is the fastest option: it runs the program on training examples, collects successful traces, and uses them as demonstrations in the prompt. MIPROv2 is the recommended production optimizer — it additionally searches over instruction text using a meta-program that generates and evaluates candidate instructions.
import dspy
from dspy.teleprompt import BootstrapFewShot, MIPROv2
program = SupportTicketRouter()
# --- BootstrapFewShot ---
# Fast, cheap — no additional LLM calls beyond your metric evaluation
# Good for: first iteration, tight cost budgets, simple programs
bootstrap_optimizer = BootstrapFewShot(
metric=answer_exact_match,
max_bootstrapped_demos=4, # max few-shot examples per module
max_labeled_demos=4, # max labeled examples from trainset
max_rounds=1, # number of bootstrap iterations
)
compiled_program = bootstrap_optimizer.compile(
student=program,
trainset=trainset,
)
# Evaluate after optimization
score = evaluator(compiled_program)
print(f"After BootstrapFewShot: {score:.1%}")
# --- MIPROv2 ---
# Slower, more thorough — searches instruction space + demonstrations
# Recommended for production — typically 5-15% accuracy gains over bootstrap
mipro_optimizer = MIPROv2(
metric=answer_exact_match,
auto="medium", # "light" | "medium" | "heavy" — controls budget
num_threads=8, # parallel candidate evaluation
verbose=True, # show optimization progress
)
compiled_program_mipro = mipro_optimizer.compile(
student=program,
trainset=trainset,
valset=devset,
num_trials=30, # number of instruction candidates to evaluate
max_bootstrapped_demos=3,
max_labeled_demos=4,
requires_permission_to_run=False, # skip interactive confirmation
)
mipro_score = evaluator(compiled_program_mipro)
print(f"After MIPROv2: {mipro_score:.1%}")
# Save the compiled program — stores optimized prompts as JSON
compiled_program_mipro.save("support_router_optimized.json")
# Load it back
restored = SupportTicketRouter()
restored.load("support_router_optimized.json")Note
dspy.configure(lm=...) inside the optimizer call if needed.Building and Optimizing a RAG Pipeline
DSPy's most common production pattern is an optimized Retrieval-Augmented Generation pipeline. The retriever is wrapped as a DSPy module, the reader is a ChainOfThought module, and the compiler automatically tunes both the query generation and answer synthesis prompts for your specific document corpus and question distribution. The retrieval quality fundamentals — chunking strategy, embedding choice, and index type — still apply — DSPy optimizes the prompts around whatever retriever you plug in, but a poor retriever still produces poor context.
import dspy
from dspy.retrieve import ChromadbRM # or any retriever adapter
# Configure retriever (Chroma example)
retriever = ChromadbRM(
collection_name="knowledge_base",
persist_directory="./chroma_db",
embedding_function=None, # uses Chroma default (sentence-transformers)
k=5, # top-k passages per query
)
dspy.configure(rm=retriever)
# Signature for the reader
class GenerateAnswer(dspy.Signature):
"""Answer the question using only information from the retrieved context."""
context: list[str] = dspy.InputField(desc="Relevant passages retrieved from the knowledge base.")
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc="A precise answer citing specific facts from the context.")
citations: str = dspy.OutputField(desc="Which passage(s) support the answer (quote key phrases).")
# Multi-hop RAG: retrieve, refine query, retrieve again, answer
class MultiHopRAG(dspy.Module):
def __init__(self, num_passages: int = 5, hops: int = 2):
self.retrieve = dspy.Retrieve(k=num_passages)
# Query refinement: generate a better search query based on initial context
self.refine_query = dspy.ChainOfThought("context, question -> search_query")
self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
self.hops = hops
def forward(self, question: str) -> dspy.Prediction:
context = []
for hop in range(self.hops):
if hop == 0:
# First hop: use the raw question
passages = self.retrieve(question).passages
else:
# Subsequent hops: refine the search query based on gathered context
refined = self.refine_query(
context=context,
question=question,
)
passages = self.retrieve(refined.search_query).passages
context.extend(passages)
# Deduplicate while preserving order
seen = set()
context = [p for p in context if not (p in seen or seen.add(p))]
result = self.generate_answer(context=context, question=question)
return dspy.Prediction(answer=result.answer, citations=result.citations, context=context)
# Metric: answer matches ground truth AND cites context
def rag_metric(example, pred, trace=None):
answer_correct = example.answer.lower() in pred.answer.lower()
has_citation = bool(pred.citations and len(pred.citations) > 20)
return float(answer_correct) * 0.8 + float(has_citation) * 0.2
# Compile the multi-hop RAG
rag = MultiHopRAG(num_passages=5, hops=2)
optimizer = BootstrapFewShot(metric=rag_metric, max_bootstrapped_demos=3)
compiled_rag = optimizer.compile(rag, trainset=rag_trainset)
compiled_rag.save("multihop_rag.json")TypedPredictor — Structured Outputs with Pydantic
For production systems that pipe LLM outputs into downstream code, untyped string outputs are fragile. DSPy's TypedPredictor validates outputs against Pydantic models at inference time and retries on validation failure — up to max_retries — feeding the validation error back to the model as a correction prompt.
import dspy
from pydantic import BaseModel, Field
from typing import Literal
class TicketAnalysis(BaseModel):
intent: Literal["billing", "technical_support", "account_management", "complaint", "general"]
urgency: Literal["low", "medium", "high", "critical"]
sentiment_score: float = Field(ge=-1.0, le=1.0, description="Sentiment from -1 (negative) to 1 (positive)")
key_issues: list[str] = Field(min_length=1, max_length=5)
requires_human: bool
class TypedAnalyzer(dspy.Signature):
"""Analyze a customer support ticket and return structured JSON."""
ticket: str = dspy.InputField()
analysis: TicketAnalysis = dspy.OutputField()
# TypedPredictor wraps any Signature with Pydantic validation + retry
analyzer = dspy.TypedPredictor(TypedAnalyzer, max_retries=3)
result = analyzer(ticket="My invoice is wrong and I've been charged twice!")
analysis: TicketAnalysis = result.analysis
print(analysis.intent) # "billing"
print(analysis.urgency) # "high"
print(analysis.sentiment_score) # -0.7
print(analysis.key_issues) # ["double charge", "incorrect invoice"]
print(analysis.requires_human) # True
# Also works with ChainOfThought for reasoning before structured output
typed_cot = dspy.TypedChainOfThought(TypedAnalyzer, max_retries=3)
result = typed_cot(ticket="System down, 500 errors for all enterprise users")
print(result.analysis.urgency) # "critical"Assertions and Suggestions — Constraining Outputs
Some constraints can't be captured by a Pydantic type — for example, ensuring a generated SQL query only selects from allowed tables, or that a summary is under 100 words. DSPy provides two runtime constraint mechanisms: dspy.Assert raises an exception and forces a retry with the constraint violation fed back to the model; dspy.Suggest does the same but degrades gracefully — it logs the violation and returns the (possibly violating) output rather than raising.
import dspy
ALLOWED_TABLES = {"orders", "customers", "products", "inventory"}
class SQLQuerySig(dspy.Signature):
"""Generate a read-only SQL SELECT query for the given request."""
request: str = dspy.InputField(desc="Natural language data request.")
allowed_tables: str = dspy.InputField(desc="Comma-separated list of tables available.")
sql: str = dspy.OutputField(desc="A valid SQL SELECT statement.")
class SafeSQLGenerator(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought(SQLQuerySig)
def forward(self, request: str) -> dspy.Prediction:
result = self.generate(
request=request,
allowed_tables=", ".join(ALLOWED_TABLES),
)
sql = result.sql.strip()
# Hard constraint — fail and retry if violated
dspy.Assert(
sql.strip().lower().startswith("select"),
"Query must be a SELECT statement — no INSERT, UPDATE, DELETE, or DROP allowed.",
)
# Check no disallowed tables are referenced
sql_lower = sql.lower()
disallowed = [t for t in ("users_pii", "payment_methods", "audit_log") if t in sql_lower]
dspy.Assert(
len(disallowed) == 0,
f"Query references restricted tables: {disallowed}. Use only: {ALLOWED_TABLES}",
)
# Soft constraint — warn but don't block
dspy.Suggest(
len(sql) < 2000,
"Query is very long — consider simplifying or breaking into CTEs.",
)
return dspy.Prediction(sql=sql)
# Enable assert-based backtracking (up to 3 retries on assertion failures)
generator = dspy.assert_transform_module(SafeSQLGenerator(), max_backtracks=3)
result = generator(request="Show me top 10 customers by order value this month")
print(result.sql)Saving, Loading, and Serving Compiled Programs
A compiled DSPy program serializes its optimized state — few-shot demonstrations, instruction text, and module configuration — to JSON. This JSON artifact is your versioned prompt artifact: commit it to Git alongside your code, load it in CI for evaluation gating, and deploy it with your application without a compilation step at startup.
import dspy
import json
from pathlib import Path
# Save a compiled program
compiled_program.save("artifacts/support_router_v2.json")
# Load it back — must use the same Program class
program = SupportTicketRouter()
program.load("artifacts/support_router_v2.json")
# Inspect what the optimizer found
for name, module in program.named_predictors():
print(f"\n=== {name} ===")
print(f"Instructions: {module.signature.instructions}")
print(f"Demonstrations: {len(module.demos)}")
for i, demo in enumerate(module.demos):
print(f" Demo {i}: {list(demo.keys())}")
# FastAPI serving — load once at startup, reuse across requests
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel as PydanticBase
app = FastAPI()
_program: SupportTicketRouter | None = None
@app.on_event("startup")
async def load_program():
global _program
lm = dspy.LM(model="openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)
_program = SupportTicketRouter()
_program.load("artifacts/support_router_v2.json")
class TicketRequest(PydanticBase):
ticket_text: str
customer_tier: str = "free"
@app.post("/classify")
async def classify_ticket(req: TicketRequest):
if _program is None:
raise HTTPException(503, "Program not loaded")
result = _program(ticket_text=req.ticket_text, customer_tier=req.customer_tier)
return {
"intent": result.intent,
"urgency": result.urgency,
"recommended_action": result.recommended_action,
"escalate": result.escalate,
}# CI evaluation gate — run before merging a new compiled artifact
# ci/evaluate_program.py
import dspy, sys, json
from pathlib import Path
lm = dspy.LM(model="openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)
program = SupportTicketRouter()
program.load("artifacts/support_router_v2.json")
evaluator = dspy.Evaluate(devset=test_set, metric=answer_exact_match, num_threads=4)
score = evaluator(program)
print(f"Evaluation score: {score:.1%}")
THRESHOLD = 0.82
if score < THRESHOLD:
print(f"FAIL — score {score:.1%} below threshold {THRESHOLD:.1%}")
sys.exit(1)
print(f"PASS — score {score:.1%} meets threshold")
sys.exit(0)Production Checklist
Version your compiled program JSON artifacts in Git alongside your code. The JSON file is your prompt artifact — it records the optimized instructions and demonstrations that produced a measured accuracy score. Without versioning, you lose the ability to roll back to a known-good prompt state when a model update or distribution shift degrades performance.
Separate training, development, and test splits strictly. The optimizer sees trainset, evaluates candidates on devset, and you report final numbers on testset — never use testset during compilation. Contaminating testset with devset examples inflates reported metrics and hides real-world performance gaps.
Run MIPROv2 with temperature=0.0 on the student LM for reproducibility. Stochastic sampling during bootstrap means the same trainset produces different demonstrations on each run. Deterministic compilation lets you compare results across runs, identify the true impact of dataset or configuration changes, and debug regressions.
Use a cheaper LM for optimization and a capable LM for the final program. BootstrapFewShot and MIPROv2 make hundreds of LLM calls per compilation. Running the optimizer against GPT-4o is 10-20x more expensive than gpt-4o-mini for equivalent optimization quality. Configure them separately: set the optimizer's LM budget on a cost-effective model and test the compiled program on your target production model.
Design metrics that proxy real business outcomes, not surface-level string matching. Exact match is only valid when answers are deterministic strings. For generation tasks, use LLM-as-judge metrics with explicit rubrics, or domain-specific metrics (F1 over extracted entities, valid JSON parse rate, SQL execution success). A metric that does not correlate with user satisfaction produces optimized-but-useless programs.
Start with 50-200 labeled examples for BootstrapFewShot, 200-500 for MIPROv2. Below 50, the optimizer has insufficient signal to select good demonstrations. Above 2000, the compilation cost grows without proportional accuracy gain — use a stratified sample of your distribution instead of the full dataset.
Use TypedPredictor for any output that feeds into downstream code. String outputs fail silently — a field that should be a float returns 'N/A' and your aggregation crashes at 3am. Pydantic validation with max_retries=3 and a correction prompt catches 90%+ of formatting failures without raising exceptions in production.
Add dspy.Assert for security-critical constraints, dspy.Suggest for quality constraints. Assert-based backtracking costs additional LLM calls per failed constraint. Use Assert sparingly for constraints that must never be violated (no DML SQL, no PII in responses). Use Suggest for constraints where a soft warning is sufficient and throughput matters.
Recompile when the base model changes. Compiled demonstrations are selected for a specific model's failure modes — demonstrations that patch GPT-4o weaknesses do not transfer to Claude or a fine-tuned model. Treat model upgrades as a trigger for recompilation, not just a configuration change.
Monitor accuracy drift in production by sampling predictions and scoring them with your metric. Compiled programs degrade as input distributions shift. A weekly evaluation job scoring 200 sampled production requests against ground truth (when available) or an LLM judge gives early warning before accuracy drops below threshold. Automated recompilation on drift detection closes the loop.
Spending engineering cycles hand-tuning prompt strings that break silently on model updates, unable to reproduce why accuracy was higher last month, or your LLM pipeline works in the notebook but degrades in production?
We design and implement DSPy-based LLM programs — from Signature and Module architecture design for your specific task, through training and development set construction with domain-appropriate metrics, BootstrapFewShot and MIPROv2 compilation with cost-optimized LM selection for the optimizer versus the student program, multi-hop RAG pipeline compilation with your retriever and corpus, TypedPredictor integration for Pydantic-validated structured outputs with retry backtracking, dspy.Assert constraint layers for security-critical output constraints, compiled JSON artifact versioning in Git with CI evaluation gates scoring against accuracy thresholds before deployment, FastAPI serving configuration loading compiled programs at startup, and production accuracy drift monitoring with automated recompilation triggers. Let’s talk.
Let's Talk