Prompt Engineering for Production Systems
Moving beyond playground prompts to production-ready prompt engineering. Version control, testing strategies, and prompt optimization techniques.
Prompt Engineering for Production Systems
Playground prompts don't survive production. The prompt that works perfectly in the ChatGPT interface often fails spectacularly when deployed at scale. Production prompt engineering is a discipline that combines software engineering rigor with iterative experimentation.
From Playground to Production
The Production Reality
What changes in production:
| Aspect | Playground | Production |
|---|---|---|
| Input variety | You control it | Users surprise you |
| Scale | Single requests | Thousands per minute |
| Cost | Free/cheap | Real budget impact |
| Reliability | "Good enough" | Must be consistent |
| Errors | You see them | Users see them |
Prompt Architecture
Structure prompts for maintainability:
interface PromptConfig {
name: string;
version: string;
systemPrompt: string;
userPromptTemplate: string;
outputSchema?: JSONSchema;
maxTokens: number;
temperature: number;
model: string;
}
const summarizationPrompt: PromptConfig = {
name: "document-summarizer",
version: "2.3.1",
systemPrompt: `You are a document summarizer for a legal tech platform.
Your summaries must be accurate, concise, and maintain legal terminology.
Never make up information not present in the source document.`,
userPromptTemplate: `Summarize the following document in {{maxWords}} words or fewer.
Focus on: {{focusAreas}}
Document:
{{document}}
Output format: {{format}}`,
outputSchema: {
type: "object",
properties: {
summary: { type: "string" },
keyPoints: { type: "array", items: { type: "string" } },
confidence: { type: "number" }
},
required: ["summary", "keyPoints"]
},
maxTokens: 500,
temperature: 0.3,
model: "gpt-4-turbo"
};Prompt Version Control
Prompts are code—treat them accordingly.
File Structure
prompts/
├── summarization/
│ ├── v1.0.0/
│ │ ├── prompt.yaml
│ │ ├── examples.json
│ │ └── test-cases.json
│ ├── v2.0.0/
│ │ └── ...
│ └── current -> v2.0.0
├── classification/
│ └── ...
└── extraction/
└── ...Prompt Definition
# prompts/summarization/v2.0.0/prompt.yaml
name: document-summarizer
version: 2.0.0
model: gpt-4-turbo
parameters:
temperature: 0.3
max_tokens: 500
system_prompt: |
You are a document summarizer for a legal tech platform.
Your summaries must be:
- Accurate: Only include information from the source
- Concise: Respect the word limit strictly
- Professional: Maintain legal terminology
Never make up information. If uncertain, say so.
user_prompt_template: |
Summarize the following document in {max_words} words or fewer.
Focus areas: {focus_areas}
Document:
{document}
Provide your response as JSON with this structure:
{
"summary": "Your summary here",
"keyPoints": ["Point 1", "Point 2"],
"wordCount": number
}
changelog:
- version: 2.0.0
date: 2024-05-01
changes:
- Added structured JSON output
- Improved accuracy instructions
- Added confidence requirement
- version: 1.0.0
date: 2024-03-15
changes:
- Initial releaseDeployment Strategy
// Gradual rollout with feature flags
async function selectPromptVersion(userId: string): Promise<PromptConfig> {
const rollout = await getFeatureFlag('prompt-summarizer-v2');
if (rollout.isEnabled(userId)) {
return loadPrompt('summarization', 'v2.0.0');
}
return loadPrompt('summarization', 'v1.0.0');
}Testing Strategies
Golden Dataset Testing
Create comprehensive test cases:
{
"testCases": [
{
"id": "legal-001",
"input": {
"document": "This Agreement ('Agreement') is entered into...",
"maxWords": 100,
"focusAreas": ["parties", "obligations", "term"]
},
"expectedOutput": {
"mustContain": ["agreement", "parties", "obligations"],
"mustNotContain": ["personal opinion"],
"maxWordCount": 100,
"minKeyPoints": 3
},
"tags": ["legal", "contract", "core"]
}
]
}Evaluation Framework
interface EvaluationResult {
testId: string;
passed: boolean;
scores: {
accuracy: number; // Did it capture key information?
relevance: number; // Is the output relevant to the request?
formatting: number; // Does it match expected format?
safety: number; // No harmful or hallucinated content?
};
latencyMs: number;
tokenUsage: {
input: number;
output: number;
cost: number;
};
}
async function evaluatePrompt(
prompt: PromptConfig,
testCases: TestCase[]
): Promise<EvaluationResult[]> {
const results: EvaluationResult[] = [];
for (const testCase of testCases) {
const start = Date.now();
const response = await callLLM(prompt, testCase.input);
const latency = Date.now() - start;
const scores = {
accuracy: evaluateAccuracy(response, testCase.expectedOutput),
relevance: evaluateRelevance(response, testCase.input),
formatting: evaluateFormat(response, prompt.outputSchema),
safety: evaluateSafety(response)
};
results.push({
testId: testCase.id,
passed: Object.values(scores).every(s => s >= 0.8),
scores,
latencyMs: latency,
tokenUsage: response.usage
});
}
return results;
}A/B Testing
interface ABTestConfig {
name: string;
variants: {
control: PromptConfig;
treatment: PromptConfig;
};
metrics: string[];
sampleSize: number;
confidenceLevel: number;
}
async function runABTest(config: ABTestConfig): Promise<ABTestResult> {
const results = {
control: { samples: [], metrics: {} },
treatment: { samples: [], metrics: {} }
};
// Collect samples
for (let i = 0; i < config.sampleSize; i++) {
const variant = Math.random() < 0.5 ? 'control' : 'treatment';
const prompt = config.variants[variant];
const result = await evaluatePrompt(prompt, [getRandomTestCase()]);
results[variant].samples.push(result);
}
// Calculate statistical significance
return calculateSignificance(results, config.confidenceLevel);
}Prompt Optimization Techniques
Few-Shot Learning
Examples dramatically improve consistency:
system_prompt: |
Extract structured data from customer support messages.
Examples:
Input: "Hi, I'm John Smith and my order #12345 hasn't arrived"
Output: {"name": "John Smith", "orderNumber": "12345", "intent": "order_status"}
Input: "Can I get a refund? Email: jane@example.com"
Output: {"email": "jane@example.com", "intent": "refund_request"}
Input: "Your product is terrible, I want my money back!"
Output: {"sentiment": "negative", "intent": "refund_request"}Chain-of-Thought for Complex Tasks
system_prompt: |
You are analyzing customer support tickets for escalation.
Think through each step:
1. Identify the customer's primary issue
2. Assess sentiment and urgency
3. Check for escalation triggers (legal threats, safety issues, VIP customer)
4. Make a recommendation with reasoning
Format your response as:
<thinking>
[Your step-by-step analysis]
</thinking>
<decision>
{
"escalate": true/false,
"priority": "low|medium|high|critical",
"reason": "Brief explanation"
}
</decision>Output Format Control
Be explicit about output format:
# Strict JSON output
user_prompt: |
Analyze this text and respond with ONLY valid JSON:
{
"sentiment": "positive|negative|neutral",
"confidence": 0.0-1.0,
"topics": ["topic1", "topic2"]
}
Do not include any text outside the JSON object.
Text to analyze: {text}Temperature Tuning
| Temperature | Use Case |
|---|---|
| 0.0-0.3 | Factual extraction, classification |
| 0.3-0.7 | Summarization, structured generation |
| 0.7-1.0 | Creative writing, brainstorming |
Production Guardrails
Input Validation
function validateInput(input: string, config: PromptConfig): ValidationResult {
const issues: string[] = [];
// Token limit check
const estimatedTokens = estimateTokens(input);
const maxInputTokens = config.maxContextTokens - config.maxOutputTokens;
if (estimatedTokens > maxInputTokens) {
issues.push(`Input too long: ${estimatedTokens} tokens (max: ${maxInputTokens})`);
}
// Content filtering
if (containsProhibitedContent(input)) {
issues.push('Input contains prohibited content');
}
// PII detection for sensitive prompts
if (config.piiSensitive && containsPII(input)) {
issues.push('Input contains PII - redaction required');
}
return {
valid: issues.length === 0,
issues,
estimatedTokens
};
}Output Validation
async function validateOutput(
output: string,
schema: JSONSchema,
constraints: OutputConstraints
): Promise<ValidationResult> {
const issues: string[] = [];
// JSON parsing
let parsed: any;
try {
parsed = JSON.parse(output);
} catch {
issues.push('Output is not valid JSON');
return { valid: false, issues };
}
// Schema validation
const schemaErrors = validateSchema(parsed, schema);
issues.push(...schemaErrors);
// Hallucination detection
if (constraints.sourceDocument) {
const hallucinations = detectHallucinations(parsed, constraints.sourceDocument);
if (hallucinations.length > 0) {
issues.push(`Potential hallucinations: ${hallucinations.join(', ')}`);
}
}
// Safety checks
const safetyIssues = checkSafety(output);
issues.push(...safetyIssues);
return { valid: issues.length === 0, issues, parsed };
}Fallback Strategy
async function executeWithFallback(
prompt: PromptConfig,
input: UserInput
): Promise<LLMResponse> {
const strategies = [
{ model: 'gpt-4-turbo', retries: 2 },
{ model: 'gpt-3.5-turbo', retries: 2 }, // Cheaper fallback
{ model: 'claude-3-sonnet', retries: 1 } // Different provider
];
for (const strategy of strategies) {
for (let attempt = 0; attempt < strategy.retries; attempt++) {
try {
const response = await callLLM({
...prompt,
model: strategy.model
}, input);
const validation = await validateOutput(response.content, prompt.outputSchema);
if (validation.valid) {
return response;
}
// Retry with explicit format reminder
if (attempt < strategy.retries - 1) {
input = addFormatReminder(input);
}
} catch (error) {
if (isRateLimitError(error)) {
await exponentialBackoff(attempt);
} else if (isContentFilterError(error)) {
throw error; // Don't retry content filter errors
}
}
}
}
// All strategies exhausted
throw new Error('Unable to generate valid response');
}Cost Management
Token Optimization
function optimizePrompt(prompt: string, maxTokens: number): string {
// Remove redundant whitespace
let optimized = prompt.replace(/\s+/g, ' ').trim();
// Shorten common phrases
const shortenings = {
'In order to': 'To',
'As a result of': 'Due to',
'At this point in time': 'Now'
};
for (const [long, short] of Object.entries(shortenings)) {
optimized = optimized.replace(new RegExp(long, 'gi'), short);
}
return optimized;
}Cost Tracking
interface CostMetrics {
promptName: string;
version: string;
requestCount: number;
totalTokens: {
input: number;
output: number;
};
totalCost: number;
averageCostPerRequest: number;
}
async function trackCost(
prompt: PromptConfig,
usage: TokenUsage
): Promise<void> {
const cost = calculateCost(prompt.model, usage);
await metrics.increment('llm.requests', {
prompt: prompt.name,
version: prompt.version,
model: prompt.model
});
await metrics.gauge('llm.cost', cost, {
prompt: prompt.name,
version: prompt.version
});
}Model Selection
| Model | Best For | Cost (per 1M tokens) |
|---|---|---|
| GPT-4 Turbo | Complex reasoning | $10-30 |
| GPT-3.5 Turbo | Simple tasks | $0.50-1.50 |
| Claude 3 Haiku | Fast, cheap | $0.25-1.25 |
| Claude 3 Sonnet | Balanced | $3-15 |
Route requests based on complexity:
function selectModel(task: TaskType, complexity: number): string {
if (complexity < 0.3) return 'gpt-3.5-turbo';
if (complexity < 0.7) return 'claude-3-sonnet';
return 'gpt-4-turbo';
}Key Takeaways
- Prompts are code: Version control, code review, and testing apply
- Test comprehensively: Golden datasets, automated evaluation, A/B testing
- Structure for reliability: Explicit output formats, validation, fallbacks
- Optimize iteratively: Few-shot examples, chain-of-thought, temperature tuning
- Plan for failure: Fallback models, retry strategies, graceful degradation
- Track costs: Token usage, model selection, cost per request
- Validate everything: Input sanitization, output validation, hallucination detection
Production prompt engineering is not about crafting the perfect prompt once—it's about building systems that consistently deliver reliable results while managing costs and handling edge cases gracefully.