Evaluation, deployment & optimization
15% of the examBuild evals, measure quality, optimize cost/latency and operate in production.
Evaluate before optimizing
- Representative eval set (real + edge cases) with measurable success criteria.
- Automatic evals (rules) + an LLM judge for open outputs, calibrated against human labels.
- Don't optimize a prompt blindly: measure each change.
Optimize in production
- Cost/latency levers: right model, prompt caching, batch, capped max_tokens, smaller context.
- Streaming for perceived latency; app-level caching of frequent results.
- Monitor: error rate, p95/p99 latency, cost/request, quality (sampling).
Practice — 10 questions
- 1. Improving a prompt: first rigorous step?
- 2. Evaluate open outputs (summaries)?
- 3. Most useful metric to detect latency degradation?
- 4. Real-time service too slow: improve perceived latency first without changing quality?
- 5. Which set of levers reduces cost per request?
- 6. Before replacing a production prompt, what guarantee?
- 7. What is an evaluation 'golden dataset'?
- 8. How to cut cost of frequent identical requests?
- 9. To judge open outputs at scale, how to make the LLM judge reliable?
- 10. Which signal to watch for quality drift in production?