18 Evaluations
BDD specs answer “can the workflow do the right thing?”
Evaluations answer the harder question:
How often does it do the right thing across real inputs?
This is where agent engineering starts to look like ML engineering: you need datasets, metrics, thresholds, and regression tracking.
18.1 Specs vs Evals (Use Both)
Think of the split like this:
- Specs protect invariants: “never call the dangerous tool without approval,” “never double-send,” “output has required fields.”
- Evals protect quality over a distribution: “is the recap readable,” “does it reliably extract action items,” “does it stay concise across many inputs.”
You want both because they fail differently:
- Specs fail fast and loudly (great for safety policies).
- Evals reveal drift and brittleness (great for quality and reliability).
18.2 The Evaluation Loop
An evaluation is just a loop with structure:
- Define a dataset of realistic inputs (messy notes, edge cases, adversarial-ish phrasing).
- Run the procedure on each case (often multiple times).
- Compute metrics and compare to thresholds.
- Track results over time.
If you’re serious about deploying agent workflows, this becomes non-optional.
18.3 Running Evaluations in Tactus
Tactus supports evaluations via an Evaluations({ ... }) block in a .tac file and the tactus eval command:
tactus eval your-procedure.tacTo measure consistency, run each case multiple times:
tactus eval your-procedure.tac --runs 10Evaluations are complementary to tactus test --runs 10:
tactus test --runsrepeats BDD scenarios (good for “does it violate invariants?”).tactus eval --runsrepeats evaluation cases with scoring (good for “how good is it?”).
18.4 Metrics You Can Trust
Start with deterministic, non-model-graded metrics whenever possible:
- schema validity (did it produce structured output?)
- constraint violations (subject length, missing action items when action language is present)
- tool policy violations (send called without approval, wrong recipients, etc.)
- success rate (“passes checks” / “fails checks”)
Model-graded metrics can be useful (readability, tone, faithfulness), but treat them like any other model:
- calibrate against human judgments
- keep prompts stable
- watch for metric drift
18.5 How It Connects to the Running Example
For the recap workflow, a small but effective evaluation dataset would include:
- short notes with a single clear action item
- long notes with multiple speakers and ambiguous actions
- notes with “action” language but missing an assignee (should produce “TBD” rather than invent)
- notes with sensitive content (make sure you don’t leak it into tool calls)
The metrics you care about early are boring—but they prevent incidents:
- does it extract at least one action item when it should?
- does it keep subject/body within constraints?
- does it avoid hallucinating commitments that aren’t in the notes?
Once those are stable, you can add higher-level quality metrics (readability, tone, summarization quality).