Production · Days 62-70

Evaluation Engineering

Evaluation engineering turns subjective AI behavior into testable product quality using golden datasets, LLM judges, RAG metrics, agent trajectory analysis, synthetic data, online evals, and red teaming.

Advanced 8 subtopics 9 daily blocks

Outcome

Create eval systems that catch regressions, score quality, test RAG faithfulness, measure agent success, and harden behavior before users find the bugs.

Practice builds

Prompt regression test suiteRAG eval dashboardAgent trajectory scorer

What to learn

Offline evals: golden datasets and regression tests

LLM-as-judge: pairwise comparison and rubric-based scoring

RAG metrics: faithfulness, answer relevance, context precision and recall

Agent evals: task success rate, trajectory analysis, tool-call accuracy

Synthetic data generation for evals

Frameworks: Promptfoo, DeepEval, Inspect, Braintrust, OpenAI Evals

Online evals on production traffic

Red teaming and adversarial testing

Daily study plan

Day 62: Build a small golden dataset for one AI feature.

Day 63: Add regression tests for expected structure and refusal boundaries.

Day 64: Create an LLM-as-judge rubric and compare two model outputs.

Day 65: Score RAG faithfulness, relevance, and context precision.

Day 66: Evaluate agent tool-call accuracy and final task success.

Day 67: Generate synthetic eval examples and manually review them.

Day 68: Run Promptfoo or DeepEval in CI.

Day 69: Design online eval sampling for production traffic.

Day 70: Red-team the feature with adversarial inputs.

Resources

Docs

Promptfoo documentation

Open resource →

Docs

DeepEval documentation

Open resource →

Tool

Inspect AI

Open resource →