2 Transparent Durability

Human-in-the-loop makes workflows pause for real humans. Multi-turn agents can take minutes. Some procedures run overnight. That only works if your procedure can stop for minutes or days and then resume correctly.

That’s what transparent durability is for. You write normal imperative code, and the runtime checkpoints every agent turn, tool call, and human interaction so your workflow never loses its place.

Real-world agents can’t always run to completion. They wait for human approval. They survive crashes. They run for days. State must persist across interruptions.

This chapter explains how it works.

2.1 The Core Idea: Every Operation Is Checkpointed

When you write:

local max_turns = 10
local turn_count = 0

done.reset()

while not done.called() and turn_count < max_turns do
    turn_count = turn_count + 1
    greeter()
end

Each agent call (for example, greeter()) creates a checkpoint. The checkpoint includes:

The complete conversation history (stored in a DSPy-compatible History format)
All tool call results
The current procedure state
Where in the code execution should resume

If the process crashes after turn 5, the runtime doesn’t re-execute turns 1-4. It replays the checkpointed results and resumes at turn 6.

2.2 Seeing It in Action

The key insight: replay is fast and free. Checkpointed operations don’t re-execute—they restore their previous results. Only operations after the last checkpoint actually run.

You can see this clearly even in a pure deterministic loop (excerpted from code/chapter-02/10-feature-state.tac):

Log.info("Starting state management example")

-- Initialize state
state.items_processed = 0

-- Process items and track count
for i = 1, 5 do
  State.increment("items_processed")
  Log.info("Processing item", {number = i})
end

-- Retrieve final state
local final_count = state.items_processed
Log.info("Completed processing", {total = final_count})

If the process restarts mid-loop, the runtime replays the completed work up to the last checkpoint and resumes from the next operation—without repeating the already-completed steps.

2.3 What Gets Checkpointed?

Tactus automatically checkpoints these operations:

Operation	What’s Saved
`greeter()`	Conversation history, response, tool calls
`Human.approve()`	The approval request and human’s response
`Human.input()`	The input request and human’s response
`Procedure.run()`	Sub-procedure inputs, outputs, and state
`Model("sentiment")({text = "..."})`	Model inputs and outputs

State changes (state.* assignment, State.increment) and log entries are also tracked, ensuring consistent replay.

2.4 Human-in-the-Loop: Where Durability Shines

The most powerful application of transparent durability is human-in-the-loop workflows. A procedure can run, do some work, and then suspend at Human.approve() (or Human.input()) until a person responds—minutes, hours, or days later.

When Human.approve() is called:

The runtime checkpoints the current state
Execution suspends (the process can exit)
The HITL request is delivered to the human (via web UI, Slack, email, etc.)
Hours pass…
The human responds
The runtime resumes from the checkpoint
approved receives the human’s response
Execution continues

The process doesn’t need to stay running. That means you don’t burn compute resources “waiting on a human,” and you don’t have to keep the same runtime environment alive until the end. The checkpoint persists. When the human finally responds, a new process can pick up exactly where the old one left off.

2.5 Example: Durable State

This runnable example (copied directly from the main Tactus repo) is simple on purpose: it demonstrates that state updates, logs, and control flow are part of durable execution.

-- State Management Example
-- Demonstrates setting, getting, and incrementing state values

-- No agents are needed for this example.

-- Procedure with outputs defined inline
Procedure {
    output = {
        success = field.boolean{required = true, description = "Whether the workflow completed successfully"},
        message = field.string{required = true, description = "Status message"},
        count = field.number{required = true, description = "Final count of processed items"},
    },
    function(input)
    Log.info("Starting state management example")

    -- Initialize state
    state.items_processed = 0

    -- Process items and track count
    for i = 1, 5 do
      State.increment("items_processed")
      Log.info("Processing item", {number = i})
    end

    -- Retrieve final state
    local final_count = state.items_processed
    Log.info("Completed processing", {total = final_count})

    return {
      success = true,
      message = "State management example completed successfully",
      count = final_count
    }
end
}

-- BDD Specifications
Specifications([[
Feature: State Management
  Demonstrate state operations in Tactus workflows

  Scenario: State operations work correctly
    Given the procedure has started
    When the procedure runs
    Then the procedure should complete successfully
    And the state items_processed should be 5
    And the output success should be True
    And the output count should be 5
]])

Run it in mock mode (no API key needed):

tactus test code/chapter-02/10-feature-state.tac --mock

2.6 The Durability Guarantee

You’ll sometimes hear Tactus summarized as “AI agents that never lose their place.” Here’s what that means in practice:

Crashes: Execution resumes from the last checkpoint
Timeouts: Partial work is preserved, completion continues later
Human delays: Execution suspends and resumes when humans respond
Process restarts: New processes pick up where old ones left off

Your agent code is just code. The complexity of durability is handled by the runtime.

Durability is one dimension of reliability. Another is predictability—knowing your agent will follow the workflow you designed. That’s why Tactus uses imperative orchestration code: you write the control flow, and the runtime guarantees it executes as written. We’ll see more of this in the orchestration patterns throughout the book.

2.7 How This Differs from Manual Checkpointing

You might think: “I could build this myself with a database and some serialization.”

You could. But consider what you’d need:

Checkpoint schema: Define what to save at each step
Serialization: Convert complex objects (conversation history, tool results) to storable format
Storage layer: Database, file system, or cloud storage
Resume detection: Know when starting fresh vs. resuming
Replay logic: Skip completed work, restore state
Consistency: Ensure checkpoints are atomic, handle partial failures

Tactus does all of this automatically. You focus on your agent’s behavior; the runtime handles persistence.

2.8 Looking Ahead

You now understand what transparent durability does and why it matters. In the next chapter, we’ll explore another key design principle: everything as code—keeping prompts, tools, orchestration, and tests together as a readable artifact.