2 Transparent Durability
Human-in-the-loop makes workflows pause for real humans. Multi-turn agents can take minutes. Some procedures run overnight. That only works if your procedure can stop for minutes or days and then resume correctly.
That’s what transparent durability is for. You write normal imperative code, and the runtime checkpoints every agent turn, tool call, and human interaction so your workflow never loses its place.
Real-world agents can’t always run to completion. They wait for human approval. They survive crashes. They run for days. State must persist across interruptions.
This chapter explains how it works.
2.1 The Core Idea: Every Operation Is Checkpointed
When you write:
local max_turns = 10
local turn_count = 0
done.reset()
while not done.called() and turn_count < max_turns do
turn_count = turn_count + 1
greeter()
endEach agent call (for example, greeter()) creates a checkpoint. The checkpoint includes:
- The complete conversation history (stored in a DSPy-compatible
Historyformat) - All tool call results
- The current procedure state
- Where in the code execution should resume
If the process crashes after turn 5, the runtime doesn’t re-execute turns 1-4. It replays the checkpointed results and resumes at turn 6.
2.2 Seeing It in Action
The key insight: replay is fast and free. Checkpointed operations don’t re-execute—they restore their previous results. Only operations after the last checkpoint actually run.
You can see this clearly even in a pure deterministic loop (excerpted from code/chapter-02/10-feature-state.tac):
Log.info("Starting state management example")
-- Initialize state
state.items_processed = 0
-- Process items and track count
for i = 1, 5 do
State.increment("items_processed")
Log.info("Processing item", {number = i})
end
-- Retrieve final state
local final_count = state.items_processed
Log.info("Completed processing", {total = final_count})If the process restarts mid-loop, the runtime replays the completed work up to the last checkpoint and resumes from the next operation—without repeating the already-completed steps.
2.3 What Gets Checkpointed?
Tactus automatically checkpoints these operations:
| Operation | What’s Saved |
|---|---|
greeter() |
Conversation history, response, tool calls |
Human.approve() |
The approval request and human’s response |
Human.input() |
The input request and human’s response |
Procedure.run() |
Sub-procedure inputs, outputs, and state |
Model("sentiment")({text = "..."}) |
Model inputs and outputs |
State changes (state.* assignment, State.increment) and log entries are also tracked, ensuring consistent replay.
2.4 Human-in-the-Loop: Where Durability Shines
The most powerful application of transparent durability is human-in-the-loop workflows. A procedure can run, do some work, and then suspend at Human.approve() (or Human.input()) until a person responds—minutes, hours, or days later.
When Human.approve() is called:
- The runtime checkpoints the current state
- Execution suspends (the process can exit)
- The HITL request is delivered to the human (via web UI, Slack, email, etc.)
- Hours pass…
- The human responds
- The runtime resumes from the checkpoint
approvedreceives the human’s response- Execution continues
The process doesn’t need to stay running. That means you don’t burn compute resources “waiting on a human,” and you don’t have to keep the same runtime environment alive until the end. The checkpoint persists. When the human finally responds, a new process can pick up exactly where the old one left off.
2.5 Example: Durable State
This runnable example (copied directly from the main Tactus repo) is simple on purpose: it demonstrates that state updates, logs, and control flow are part of durable execution.
-- State Management Example
-- Demonstrates setting, getting, and incrementing state values
-- No agents are needed for this example.
-- Procedure with outputs defined inline
Procedure {
output = {
success = field.boolean{required = true, description = "Whether the workflow completed successfully"},
message = field.string{required = true, description = "Status message"},
count = field.number{required = true, description = "Final count of processed items"},
},
function(input)
Log.info("Starting state management example")
-- Initialize state
state.items_processed = 0
-- Process items and track count
for i = 1, 5 do
State.increment("items_processed")
Log.info("Processing item", {number = i})
end
-- Retrieve final state
local final_count = state.items_processed
Log.info("Completed processing", {total = final_count})
return {
success = true,
message = "State management example completed successfully",
count = final_count
}
end
}
-- BDD Specifications
Specifications([[
Feature: State Management
Demonstrate state operations in Tactus workflows
Scenario: State operations work correctly
Given the procedure has started
When the procedure runs
Then the procedure should complete successfully
And the state items_processed should be 5
And the output success should be True
And the output count should be 5
]])Run it in mock mode (no API key needed):
tactus test code/chapter-02/10-feature-state.tac --mock2.6 The Durability Guarantee
You’ll sometimes hear Tactus summarized as “AI agents that never lose their place.” Here’s what that means in practice:
- Crashes: Execution resumes from the last checkpoint
- Timeouts: Partial work is preserved, completion continues later
- Human delays: Execution suspends and resumes when humans respond
- Process restarts: New processes pick up where old ones left off
Your agent code is just code. The complexity of durability is handled by the runtime.
Durability is one dimension of reliability. Another is predictability—knowing your agent will follow the workflow you designed. That’s why Tactus uses imperative orchestration code: you write the control flow, and the runtime guarantees it executes as written. We’ll see more of this in the orchestration patterns throughout the book.
2.7 How This Differs from Manual Checkpointing
You might think: “I could build this myself with a database and some serialization.”
You could. But consider what you’d need:
- Checkpoint schema: Define what to save at each step
- Serialization: Convert complex objects (conversation history, tool results) to storable format
- Storage layer: Database, file system, or cloud storage
- Resume detection: Know when starting fresh vs. resuming
- Replay logic: Skip completed work, restore state
- Consistency: Ensure checkpoints are atomic, handle partial failures
Tactus does all of this automatically. You focus on your agent’s behavior; the runtime handles persistence.
2.8 Looking Ahead
You now understand what transparent durability does and why it matters. In the next chapter, we’ll explore another key design principle: everything as code—keeping prompts, tools, orchestration, and tests together as a readable artifact.