13 Evaluations Reference
Tactus has two “evaluation” concepts:
- Matchers (
contains(...),equals(...),matches(...)) for lightweight checks (often in BDD). - Pydantic Evals via
Evaluations({...})andtactus eval ...for dataset-style quality evaluation.
13.1 Matchers
contains("error") -- ("contains", "error")
equals("done") -- ("equals", "done")
matches("^OK:") -- ("matches", "^OK:")13.2 Pydantic Evals: Evaluations({ ... })
Minimal example (deterministic evaluators):
Procedure {
input = {name = field.string{required = true}},
output = {greeting = field.string{required = true}},
function(input)
return {greeting = "Hello, " .. input.name .. "!"}
end
}
Evaluations({
dataset = {
{name = "alice", inputs = {name = "Alice"}, expected_output = {greeting = "Hello, Alice!"}}
},
evaluators = {
field.equals_expected{},
field.contains{},
field.min_length{}
},
runs = 1,
parallel = true
})Run from the CLI:
tactus eval procedure.tac
tactus eval procedure.tac --runs 1013.3 Evaluator Types
Evaluators are selected by type (the field.*{} helpers expand to evaluator configs).
| Type | Use |
|---|---|
contains |
Output (or field) contains a substring |
contains_any |
Output contains any of N strings |
equals_expected / exact_match |
Output equals expected_output |
is_instance |
Type check |
min_length / max_length |
String length constraints |
llm_judge |
LLM-as-judge scoring |
regex |
Regex match |
json_schema |
Validate JSON-like structure |
range |
Numeric bounds |
tool_called |
Assert tool usage from trace |
state_check |
Assert state value from trace |
agent_turns |
Assert agent turn counts from trace |
max_iterations |
Guardrail for loops |
max_cost / max_tokens |
Resource constraints |
13.4 Thresholds (Quality Gates)
Evaluations({
dataset = {...},
evaluators = {...},
thresholds = {
min_success_rate = 0.90,
max_cost_per_run = 0.01,
max_duration = 10.0,
max_tokens_per_run = 500
}
})