18 Evaluations

BDD specs answer “can the workflow do the right thing?”

Evaluations answer the harder question:

How often does it do the right thing across real inputs?

This is where agent engineering starts to look like ML engineering: you need datasets, metrics, thresholds, and regression tracking.

18.1 Specs vs Evals (Use Both)

Think of the split like this:

Specs protect invariants: “never call the dangerous tool without approval,” “never double-send,” “output has required fields.”
Evals protect quality over a distribution: “is the recap readable,” “does it reliably extract action items,” “does it stay concise across many inputs.”

You want both because they fail differently:

An evaluation is just a loop with structure:

Define a dataset of realistic inputs (messy notes, edge cases, adversarial-ish phrasing).
Run the procedure on each case (often multiple times).
Compute metrics and compare to thresholds.
Track results over time.

If you’re serious about deploying agent workflows, this becomes non-optional.

Tactus supports evaluations via an Evaluations({ ... }) block in a .tac file and the tactus eval command:

tactus eval your-procedure.tac

To measure consistency, run each case multiple times:

tactus eval your-procedure.tac --runs 10

Evaluations are complementary to tactus test --runs 10:

tactus test --runs repeats BDD scenarios (good for “does it violate invariants?”).
tactus eval --runs repeats evaluation cases with scoring (good for “how good is it?”).

Start with deterministic, non-model-graded metrics whenever possible:

schema validity (did it produce structured output?)
constraint violations (subject length, missing action items when action language is present)
tool policy violations (send called without approval, wrong recipients, etc.)
success rate (“passes checks” / “fails checks”)

Model-graded metrics can be useful (readability, tone, faithfulness), but treat them like any other model:

For the recap workflow, a small but effective evaluation dataset would include:

short notes with a single clear action item
long notes with multiple speakers and ambiguous actions
notes with “action” language but missing an assignee (should produce “TBD” rather than invent)
notes with sensitive content (make sure you don’t leak it into tool calls)

The metrics you care about early are boring—but they prevent incidents:

Once those are stable, you can add higher-level quality metrics (readability, tone, summarization quality).