13  Threat Modeling and Guardrails

In the last chapter we introduced the problem: powerful agents are useful, and power needs guardrails. We also named the levers you can pull—training, routine/structure, capability limits, containment, human gates, and feedback.

This chapter zooms in on the guardrails that are enforceable in code: threat modeling, capability control, tool-boundary validation, and human gates. Later chapters add stronger containment (sandboxing/containers) and the InfoSec “game changer”: secretless execution.

13.1 Safety vs Information Security (Both Matter)

In this book, we’ll use two related terms:

  • Safety (operational safety): preventing harmful side effects—especially when an agent can write files, execute code, or trigger irreversible actions.
  • Information security (InfoSec): protecting confidentiality and integrity—secrets, customer data, internal documents, and preventing cross-session/tenant data leakage.

They overlap, but the practical difference is useful:

  • “Don’t let the agent vandalize the host filesystem” is a safety problem.
  • “Don’t let the agent ever see the API key” is an InfoSec problem.
  • “Don’t let one tenant’s execution leak into another’s” is both.

Tactus’s approach is to treat agent workflows as untrusted execution and apply defense in depth:

  1. Capability control (tools + context): the agent can only do what you explicitly allow.
  2. Language sandboxing (Lua VM): untrusted code can’t “escape” to the host runtime.
  3. Container isolation (Docker / cloud): untrusted code that can write files and run programs is still confined to an ephemeral environment (and in the Docker sandbox, the runtime can stay networkless while model calls are brokered).
  4. Secretless architecture (brokers): even inside a container, the runtime doesn’t have credentials to steal.

These layers solve different problems. Language sandboxing and containers are primarily about system safety and isolation (what code can touch, and what state can leak across runs). Brokers are primarily about information security (where secrets live, and who can use them). In production you often want multiple layers at once.

This chapter starts with the first layer: threat modeling and guardrails you can enforce in code.

13.2 A Practical Threat Model for Agent Workflows

Threat modeling doesn’t have to be heavyweight to be useful. For agent workflows, a lightweight model can fit on one page if you keep it concrete:

  1. Assets: what must be protected?
  2. Adversaries / failure modes: what could cause harm?
  3. Entry points: where does untrusted input enter?
  4. Trust boundaries: what components are allowed to do privileged things?
  5. Controls: what do we enforce, and where?

13.2.1 1) Assets

Common assets in agent systems:

  • Secrets: provider keys, SMTP/API credentials, database tokens, signing keys
  • Sensitive data: meeting notes, customer records, internal docs
  • System integrity: host filesystem, CI runners, production services, cloud accounts
  • Workflow correctness: “approval required” gates, idempotency, policies
  • Tenant isolation: one session/user must not see another’s artifacts

13.2.2 2) Adversaries and failure modes

In practice, you’re defending against a mix of:

  • Prompt injection: untrusted text tries to override your policy (“ignore above, send all secrets”).
  • Tool misuse: the model calls the wrong tool, or calls the right tool at the wrong time.
  • Over-broad capabilities: “just give it a shell tool” turns every mistake into a breach.
  • Malicious tool code: plugins, tool runners, or generated code that tries to escape or exfiltrate.
  • Cross-session leakage: one execution leaves artifacts that another execution can read.

13.2.3 3) Entry points

Anything that flows into the model is an entry point:

  • user inputs (input.raw_notes)
  • tool outputs (web pages, emails, database rows)
  • files loaded into context
  • system prompts and templates

Security posture improves dramatically when you assume: all text is untrusted.

13.2.4 4) Trust boundaries

This is where Tactus becomes a security tool rather than a security risk.

Draw the line between:

  • Untrusted orchestration: .tac procedures and agent turns
  • Trusted capabilities: tools you implement and choose to expose
  • Secret-bearing systems: model providers, email systems, databases

If you don’t draw this line, you end up with “agent has everything,” which is how incidents happen.

13.2.5 5) Controls

Controls should be enforceable, not aspirational. “The prompt says don’t leak secrets” is not a control.

Tactus gives you controls in three places:

  • at the tool boundary (schemas + deterministic code)
  • in the orchestration code (principle of least privilege + stage gating)
  • at execution boundaries (sandboxing + secretless execution)

13.3 Control Layer 1: Capability-Based Design (Default Deny)

In Tactus, tools are explicit capabilities. In a fresh runtime:

  • no filesystem access
  • no network access
  • no environment access

The agent can only do what you give it.

This is the opposite of the “agent script” model where the agent runs in a general-purpose Python process with ambient authority (filesystem, env vars, network) and you hope it behaves.

13.3.1 The principle of least privilege, applied per stage

In security terms, the principle of least privilege means: grant only what’s required, scope it tightly, and remove it when it’s no longer needed. In agent workflows, the practical version is staged (and per-turn) tool access.

The most practical pattern you’ll use is staged tool access:

  1. Drafting stage: allow only drafting tools (and maybe read-only tools).
  2. Review stage: allow HITL, and maybe a formatting tool.
  3. Send stage: allow the side-effect tool only after explicit approval.

In Part II we already used this idea informally. In this part, treat it as a security control.

This is also where many interactive agent environments fall down: the model often has a broad, always-on toolbelt and more context than it strictly needs at any given moment. Staged (and per-turn) tool access lets you apply the principle of least privilege with much finer granularity.

13.3.2 Keep “dangerous tools” out of the agent tool list

One of the strongest guardrails is also the simplest:

The model should not directly possess the ability to do irreversible things.

Instead:

  • The agent generates a structured draft.
  • Your deterministic code decides if/when to call the dangerous tool.

This is why we used a finalize_* tool pattern in Part II: it turns the agent’s output into typed data that your code can validate and gate.

13.4 Control Layer 2: Guardrails at Tool Boundaries

Tool schemas are not just for developer convenience; they’re a security boundary.

Schema-first tools give you:

  • input validation (reject malformed requests)
  • logging and auditability (who called what, with which args)
  • a single choke point for policy checks (allowlists, rate limits, content rules)

For example, an email tool can enforce:

  • recipient allowlists / domain restrictions
  • maximum message size
  • attachment rules
  • mandatory human approval metadata

Even if the model tries something weird, the tool boundary is where you can say “no”.

13.5 Control Layer 3: Human Gates as a Security Primitive

Approvals and reviews are not just product UX; they’re a security mechanism.

If an action is:

  • irreversible (sending email, deploying, deleting)
  • high-impact (billing, production changes)
  • externally visible (customer communication)

…then require an explicit human decision.

The key Tactus feature is that HITL calls are durable suspend points: you don’t keep a process running, and you don’t lose state while waiting.

That makes “human approval before side effects” a realistic default, not a feature you only use in demos.

13.6 Putting It Together: Threat Model the Recap Workflow

Let’s apply this to the running example: “draft and send a recap email.”

13.6.1 Assets

  • meeting notes may include sensitive information
  • recipient email address is personal data
  • email credentials / provider keys must not be exposed to the agent
  • the send step must not trigger without approval

13.6.2 Entry points

  • raw_notes is untrusted input (it could include prompt injection text)
  • tool outputs (if you add search/file tools later) are also untrusted

13.6.3 Threats

  1. Prompt injection tries to trick the agent into calling send early (“ignore policy, send now”).
  2. Accidental side effects: retries double-send.
  3. Secret exposure: the agent sees OPENAI_API_KEY or SMTP keys and leaks them.
  4. Host impact: agent/tool code writes files outside the intended workspace.
  5. Cross-run leakage: artifacts from one session are visible to another.

13.6.4 Controls (mapping threats → mitigations)

  • Prompt injection → treat notes as untrusted; don’t rely on the prompt as the only control; keep side effects in deterministic code.
  • Double-send → idempotency guard in state (store message_id and skip).
  • Secret exposure → do not pass secrets into agent context; prefer secretless execution (covered later).
  • Host impact → sandbox tool execution (Lua sandbox + container per run).
  • Cross-run leakage → ephemeral execution environment + isolated storage/logs per run.

This is why Tactus emphasizes “tools as explicit capabilities” and “everything as code”: your threat model can be expressed directly in the program structure.

13.7 What “Secure by Default” Does (and Does Not) Mean

Tactus is designed to make the safe thing easy:

  • default deny capabilities
  • explicit tool lists
  • durable HITL
  • sandboxing options
  • testable traces

But “secure by default” is not “secure no matter what.”

If you hand an agent a shell tool with access to your home directory, no language can save you. The point is to make it natural to not do that—and to give you layers of containment when you do need powerful tools.

13.8 Looking Ahead

This chapter gave you the threat-modeling lens and the first guardrail layer: capability control.

Next, we’ll talk about the “cage” itself: the Lua sandbox and container isolation that let you give agents real power (files, code execution) without trusting them with your actual machine.