13  Evaluations Reference

Tactus has two “evaluation” concepts:

13.1 Matchers

contains("error")   -- ("contains", "error")
equals("done")      -- ("equals", "done")
matches("^OK:")     -- ("matches", "^OK:")

13.2 Pydantic Evals: Evaluations({ ... })

Minimal example (deterministic evaluators):

Procedure {
  input = {name = field.string{required = true}},
  output = {greeting = field.string{required = true}},
  function(input)
    return {greeting = "Hello, " .. input.name .. "!"}
  end
}

Evaluations({
  dataset = {
    {name = "alice", inputs = {name = "Alice"}, expected_output = {greeting = "Hello, Alice!"}}
  },
  evaluators = {
    field.equals_expected{},
    field.contains{},
    field.min_length{}
  },
  runs = 1,
  parallel = true
})

Run from the CLI:

tactus eval procedure.tac
tactus eval procedure.tac --runs 10

13.3 Evaluator Types

Evaluators are selected by type (the field.*{} helpers expand to evaluator configs).

Type Use
contains Output (or field) contains a substring
contains_any Output contains any of N strings
equals_expected / exact_match Output equals expected_output
is_instance Type check
min_length / max_length String length constraints
llm_judge LLM-as-judge scoring
regex Regex match
json_schema Validate JSON-like structure
range Numeric bounds
tool_called Assert tool usage from trace
state_check Assert state value from trace
agent_turns Assert agent turn counts from trace
max_iterations Guardrail for loops
max_cost / max_tokens Resource constraints

13.4 Thresholds (Quality Gates)

Evaluations({
  dataset = {...},
  evaluators = {...},
  thresholds = {
    min_success_rate = 0.90,
    max_cost_per_run = 0.01,
    max_duration = 10.0,
    max_tokens_per_run = 500
  }
})