Evaluation, deployment & optimization

15% of the exam

Build evals, measure quality, optimize cost/latency and operate in production.

Evaluate before optimizing

Representative eval set (real + edge cases) with measurable success criteria.
Automatic evals (rules) + an LLM judge for open outputs, calibrated against human labels.
Don't optimize a prompt blindly: measure each change.

Cost/latency levers: right model, prompt caching, batch, capped max_tokens, smaller context.
Streaming for perceived latency; app-level caching of frequent results.
Monitor: error rate, p95/p99 latency, cost/request, quality (sampling).

0/10 answered

1. Improving a prompt: first rigorous step?
2. Evaluate open outputs (summaries)?
3. Most useful metric to detect latency degradation?
4. Real-time service too slow: improve perceived latency first without changing quality?
5. Which set of levers reduces cost per request?
6. Before replacing a production prompt, what guarantee?
7. What is an evaluation 'golden dataset'?
8. How to cut cost of frequent identical requests?
9. To judge open outputs at scale, how to make the LLM judge reliable?
10. Which signal to watch for quality drift in production?