M

Production · Days 62-70

Evaluation Engineering

Evaluation engineering turns subjective AI behavior into testable product quality using golden datasets, LLM judges, RAG metrics, agent trajectory analysis, synthetic data, online evals, and red teaming.

Advanced 8 subtopics 9 daily blocks

Outcome

Create eval systems that catch regressions, score quality, test RAG faithfulness, measure agent success, and harden behavior before users find the bugs.

Practice builds

Prompt regression test suiteRAG eval dashboardAgent trajectory scorer

What to learn

Offline evals: golden datasets and regression tests
LLM-as-judge: pairwise comparison and rubric-based scoring
RAG metrics: faithfulness, answer relevance, context precision and recall
Agent evals: task success rate, trajectory analysis, tool-call accuracy
Synthetic data generation for evals
Frameworks: Promptfoo, DeepEval, Inspect, Braintrust, OpenAI Evals
Online evals on production traffic
Red teaming and adversarial testing

Daily study plan

Day 62: Build a small golden dataset for one AI feature.
Day 63: Add regression tests for expected structure and refusal boundaries.
Day 64: Create an LLM-as-judge rubric and compare two model outputs.
Day 65: Score RAG faithfulness, relevance, and context precision.
Day 66: Evaluate agent tool-call accuracy and final task success.
Day 67: Generate synthetic eval examples and manually review them.
Day 68: Run Promptfoo or DeepEval in CI.
Day 69: Design online eval sampling for production traffic.
Day 70: Red-team the feature with adversarial inputs.

Resources