Skip to main content
Back to Blog
5 May 202416 min read

Prompt Engineering for Production Systems

AI/MLPrompt EngineeringLLMBest Practices

Moving beyond playground prompts to production-ready prompt engineering. Version control, testing strategies, and prompt optimization techniques.


Prompt Engineering for Production Systems

Playground prompts don't survive production. The prompt that works perfectly in the ChatGPT interface often fails spectacularly when deployed at scale. Production prompt engineering is a discipline that combines software engineering rigor with iterative experimentation.

From Playground to Production

The Production Reality

What changes in production:

AspectPlaygroundProduction
Input varietyYou control itUsers surprise you
ScaleSingle requestsThousands per minute
CostFree/cheapReal budget impact
Reliability"Good enough"Must be consistent
ErrorsYou see themUsers see them

Prompt Architecture

Structure prompts for maintainability:

interface PromptConfig { name: string; version: string; systemPrompt: string; userPromptTemplate: string; outputSchema?: JSONSchema; maxTokens: number; temperature: number; model: string; } const summarizationPrompt: PromptConfig = { name: "document-summarizer", version: "2.3.1", systemPrompt: `You are a document summarizer for a legal tech platform. Your summaries must be accurate, concise, and maintain legal terminology. Never make up information not present in the source document.`, userPromptTemplate: `Summarize the following document in {{maxWords}} words or fewer. Focus on: {{focusAreas}} Document: {{document}} Output format: {{format}}`, outputSchema: { type: "object", properties: { summary: { type: "string" }, keyPoints: { type: "array", items: { type: "string" } }, confidence: { type: "number" } }, required: ["summary", "keyPoints"] }, maxTokens: 500, temperature: 0.3, model: "gpt-4-turbo" };

Prompt Version Control

Prompts are code—treat them accordingly.

File Structure

prompts/
├── summarization/
│   ├── v1.0.0/
│   │   ├── prompt.yaml
│   │   ├── examples.json
│   │   └── test-cases.json
│   ├── v2.0.0/
│   │   └── ...
│   └── current -> v2.0.0
├── classification/
│   └── ...
└── extraction/
    └── ...

Prompt Definition

# prompts/summarization/v2.0.0/prompt.yaml name: document-summarizer version: 2.0.0 model: gpt-4-turbo parameters: temperature: 0.3 max_tokens: 500 system_prompt: | You are a document summarizer for a legal tech platform. Your summaries must be: - Accurate: Only include information from the source - Concise: Respect the word limit strictly - Professional: Maintain legal terminology Never make up information. If uncertain, say so. user_prompt_template: | Summarize the following document in {max_words} words or fewer. Focus areas: {focus_areas} Document: {document} Provide your response as JSON with this structure: { "summary": "Your summary here", "keyPoints": ["Point 1", "Point 2"], "wordCount": number } changelog: - version: 2.0.0 date: 2024-05-01 changes: - Added structured JSON output - Improved accuracy instructions - Added confidence requirement - version: 1.0.0 date: 2024-03-15 changes: - Initial release

Deployment Strategy

// Gradual rollout with feature flags async function selectPromptVersion(userId: string): Promise<PromptConfig> { const rollout = await getFeatureFlag('prompt-summarizer-v2'); if (rollout.isEnabled(userId)) { return loadPrompt('summarization', 'v2.0.0'); } return loadPrompt('summarization', 'v1.0.0'); }

Testing Strategies

Golden Dataset Testing

Create comprehensive test cases:

{ "testCases": [ { "id": "legal-001", "input": { "document": "This Agreement ('Agreement') is entered into...", "maxWords": 100, "focusAreas": ["parties", "obligations", "term"] }, "expectedOutput": { "mustContain": ["agreement", "parties", "obligations"], "mustNotContain": ["personal opinion"], "maxWordCount": 100, "minKeyPoints": 3 }, "tags": ["legal", "contract", "core"] } ] }

Evaluation Framework

interface EvaluationResult { testId: string; passed: boolean; scores: { accuracy: number; // Did it capture key information? relevance: number; // Is the output relevant to the request? formatting: number; // Does it match expected format? safety: number; // No harmful or hallucinated content? }; latencyMs: number; tokenUsage: { input: number; output: number; cost: number; }; } async function evaluatePrompt( prompt: PromptConfig, testCases: TestCase[] ): Promise<EvaluationResult[]> { const results: EvaluationResult[] = []; for (const testCase of testCases) { const start = Date.now(); const response = await callLLM(prompt, testCase.input); const latency = Date.now() - start; const scores = { accuracy: evaluateAccuracy(response, testCase.expectedOutput), relevance: evaluateRelevance(response, testCase.input), formatting: evaluateFormat(response, prompt.outputSchema), safety: evaluateSafety(response) }; results.push({ testId: testCase.id, passed: Object.values(scores).every(s => s >= 0.8), scores, latencyMs: latency, tokenUsage: response.usage }); } return results; }

A/B Testing

interface ABTestConfig { name: string; variants: { control: PromptConfig; treatment: PromptConfig; }; metrics: string[]; sampleSize: number; confidenceLevel: number; } async function runABTest(config: ABTestConfig): Promise<ABTestResult> { const results = { control: { samples: [], metrics: {} }, treatment: { samples: [], metrics: {} } }; // Collect samples for (let i = 0; i < config.sampleSize; i++) { const variant = Math.random() < 0.5 ? 'control' : 'treatment'; const prompt = config.variants[variant]; const result = await evaluatePrompt(prompt, [getRandomTestCase()]); results[variant].samples.push(result); } // Calculate statistical significance return calculateSignificance(results, config.confidenceLevel); }

Prompt Optimization Techniques

Few-Shot Learning

Examples dramatically improve consistency:

system_prompt: | Extract structured data from customer support messages. Examples: Input: "Hi, I'm John Smith and my order #12345 hasn't arrived" Output: {"name": "John Smith", "orderNumber": "12345", "intent": "order_status"} Input: "Can I get a refund? Email: jane@example.com" Output: {"email": "jane@example.com", "intent": "refund_request"} Input: "Your product is terrible, I want my money back!" Output: {"sentiment": "negative", "intent": "refund_request"}

Chain-of-Thought for Complex Tasks

system_prompt: | You are analyzing customer support tickets for escalation. Think through each step: 1. Identify the customer's primary issue 2. Assess sentiment and urgency 3. Check for escalation triggers (legal threats, safety issues, VIP customer) 4. Make a recommendation with reasoning Format your response as: <thinking> [Your step-by-step analysis] </thinking> <decision> { "escalate": true/false, "priority": "low|medium|high|critical", "reason": "Brief explanation" } </decision>

Output Format Control

Be explicit about output format:

# Strict JSON output user_prompt: | Analyze this text and respond with ONLY valid JSON: { "sentiment": "positive|negative|neutral", "confidence": 0.0-1.0, "topics": ["topic1", "topic2"] } Do not include any text outside the JSON object. Text to analyze: {text}

Temperature Tuning

TemperatureUse Case
0.0-0.3Factual extraction, classification
0.3-0.7Summarization, structured generation
0.7-1.0Creative writing, brainstorming

Production Guardrails

Input Validation

function validateInput(input: string, config: PromptConfig): ValidationResult { const issues: string[] = []; // Token limit check const estimatedTokens = estimateTokens(input); const maxInputTokens = config.maxContextTokens - config.maxOutputTokens; if (estimatedTokens > maxInputTokens) { issues.push(`Input too long: ${estimatedTokens} tokens (max: ${maxInputTokens})`); } // Content filtering if (containsProhibitedContent(input)) { issues.push('Input contains prohibited content'); } // PII detection for sensitive prompts if (config.piiSensitive && containsPII(input)) { issues.push('Input contains PII - redaction required'); } return { valid: issues.length === 0, issues, estimatedTokens }; }

Output Validation

async function validateOutput( output: string, schema: JSONSchema, constraints: OutputConstraints ): Promise<ValidationResult> { const issues: string[] = []; // JSON parsing let parsed: any; try { parsed = JSON.parse(output); } catch { issues.push('Output is not valid JSON'); return { valid: false, issues }; } // Schema validation const schemaErrors = validateSchema(parsed, schema); issues.push(...schemaErrors); // Hallucination detection if (constraints.sourceDocument) { const hallucinations = detectHallucinations(parsed, constraints.sourceDocument); if (hallucinations.length > 0) { issues.push(`Potential hallucinations: ${hallucinations.join(', ')}`); } } // Safety checks const safetyIssues = checkSafety(output); issues.push(...safetyIssues); return { valid: issues.length === 0, issues, parsed }; }

Fallback Strategy

async function executeWithFallback( prompt: PromptConfig, input: UserInput ): Promise<LLMResponse> { const strategies = [ { model: 'gpt-4-turbo', retries: 2 }, { model: 'gpt-3.5-turbo', retries: 2 }, // Cheaper fallback { model: 'claude-3-sonnet', retries: 1 } // Different provider ]; for (const strategy of strategies) { for (let attempt = 0; attempt < strategy.retries; attempt++) { try { const response = await callLLM({ ...prompt, model: strategy.model }, input); const validation = await validateOutput(response.content, prompt.outputSchema); if (validation.valid) { return response; } // Retry with explicit format reminder if (attempt < strategy.retries - 1) { input = addFormatReminder(input); } } catch (error) { if (isRateLimitError(error)) { await exponentialBackoff(attempt); } else if (isContentFilterError(error)) { throw error; // Don't retry content filter errors } } } } // All strategies exhausted throw new Error('Unable to generate valid response'); }

Cost Management

Token Optimization

function optimizePrompt(prompt: string, maxTokens: number): string { // Remove redundant whitespace let optimized = prompt.replace(/\s+/g, ' ').trim(); // Shorten common phrases const shortenings = { 'In order to': 'To', 'As a result of': 'Due to', 'At this point in time': 'Now' }; for (const [long, short] of Object.entries(shortenings)) { optimized = optimized.replace(new RegExp(long, 'gi'), short); } return optimized; }

Cost Tracking

interface CostMetrics { promptName: string; version: string; requestCount: number; totalTokens: { input: number; output: number; }; totalCost: number; averageCostPerRequest: number; } async function trackCost( prompt: PromptConfig, usage: TokenUsage ): Promise<void> { const cost = calculateCost(prompt.model, usage); await metrics.increment('llm.requests', { prompt: prompt.name, version: prompt.version, model: prompt.model }); await metrics.gauge('llm.cost', cost, { prompt: prompt.name, version: prompt.version }); }

Model Selection

ModelBest ForCost (per 1M tokens)
GPT-4 TurboComplex reasoning$10-30
GPT-3.5 TurboSimple tasks$0.50-1.50
Claude 3 HaikuFast, cheap$0.25-1.25
Claude 3 SonnetBalanced$3-15

Route requests based on complexity:

function selectModel(task: TaskType, complexity: number): string { if (complexity < 0.3) return 'gpt-3.5-turbo'; if (complexity < 0.7) return 'claude-3-sonnet'; return 'gpt-4-turbo'; }

Key Takeaways

  1. Prompts are code: Version control, code review, and testing apply
  2. Test comprehensively: Golden datasets, automated evaluation, A/B testing
  3. Structure for reliability: Explicit output formats, validation, fallbacks
  4. Optimize iteratively: Few-shot examples, chain-of-thought, temperature tuning
  5. Plan for failure: Fallback models, retry strategies, graceful degradation
  6. Track costs: Token usage, model selection, cost per request
  7. Validate everything: Input sanitization, output validation, hallucination detection

Production prompt engineering is not about crafting the perfect prompt once—it's about building systems that consistently deliver reliable results while managing costs and handling edge cases gracefully.

Share this article