Text annotate

Text annotate is a reusable utility for attaching structured attributes (labels, phases, roles) to spans of text without re-emitting the document.

If you ask a model to “return all the verbs with their types” as a JSON list, you pay for the output tokens of every word, and you risk the model hallucinating words that aren’t there.

Text annotate uses the virtual file pattern to solve this. Biblicus gives the model a virtual file and asks it to insert XML tags with attributes in-place (e.g., ...). The model returns a small edit script (str_replace only), and Biblicus applies it and parses the result into structured, attributed spans. You get rich metadata without the cost or risk of text regeneration.

How text annotate works

Biblicus loads the full text into memory.
The model receives the text and returns an edit script with str_replace operations.
Biblicus applies the operations and validates that only span tags were inserted.
The marked-up string is parsed into ordered spans with attributes.

The model never re-emits the full text. It only inserts tags in-place.

Mechanism example

Biblicus supplies an internal protocol that defines the edit protocol, allowed attributes, and embeds the current text:

Internal protocol (excerpt):

You are a virtual file editor. Use the available tools to edit the text.
Interpret the word "return" in the user's request as: wrap the returned text with
<span ATTRIBUTE="VALUE">...</span> in-place in the current text.
Each span must include exactly one attribute. Allowed attributes: label, phase, role.
Current text:
---
We run fast.
---

Then provide a short user prompt describing what to return:

User prompt:

Return all the verbs.

The input text is the same content embedded in the internal protocol:

Input text:

We run fast.

The model edits the virtual file by inserting tags in-place:

Marked-up text:

We <span label="verb">run</span> fast.

Biblicus returns structured data parsed from the markup:

Structured data (result):

{
  "marked_up_text": "We <span label=\"verb\">run</span> fast.",
  "spans": [
    {
      "index": 1,
      "start_char": 3,
      "end_char": 6,
      "text": "run",
      "attributes": {"label": "verb"}
    }
  ],
  "warnings": []
}

Data model

Text annotate uses Pydantic models for strict validation:

TextAnnotateRequest: input text + LLM config + prompt template + allowed attributes.
TextAnnotateResult: marked-up text and attributed spans.

Internal protocol templates (advanced overrides) must include {text}. Prompt templates must not include {text} and should only describe what to return. The internal protocol template can interpolate the allowed attributes list via Jinja2.

Most callers only supply the user prompt and text. Override system_prompt only when you need to customize the edit protocol.

Output contract

Text annotate is tool-driven. The model must use tool calls instead of returning JSON in the assistant message.

Tool call arguments:

str_replace(old_str="Hello", new_str="<span label=\"greeting\">Hello</span>")
done()

Rules:

Use the str_replace tool only.
Each old_str must match exactly once.
Each new_str must be the same text with span tags inserted.
Only  and  tags are allowed.
Each span must include exactly one attribute.
Attributes must be on the allow list.

Long-span handling: the system prompt instructs the model to insert  and  in separate str_replace calls for long passages (single-call insertion is allowed for short spans). This is covered by unit tests in tests/test_text_utility_tool_calls.py.

Example: Python API

from biblicus.ai.models import AiProvider, LlmClientConfig
from biblicus.text import TextAnnotateRequest, apply_text_annotate

request = TextAnnotateRequest(
    text="We run fast.",
    client=LlmClientConfig(provider=AiProvider.OPENAI, model="gpt-4o-mini"),
    prompt_template="Return all the verbs.",
    allowed_attributes=["label", "phase", "role"],
)
result = apply_text_annotate(request)

Concept: Text Annotate FAQ

What problem does this solve?

Some ETL pipelines need labeled spans, not just extracted spans. For example, you may want phases of a conversation (greeting, verification, resolution) or roles (agent vs customer). Existing extract/slice utilities can return spans, but they cannot attach structured attributes in a consistent way. Text annotate provides a standardized way to attach attributes while preserving the original text.

How is it different from text extract?

Extract returns spans only (no attributes).
Annotate returns spans with attributes (when labels are required).

This keeps prompts simple for small models by using annotate only when needed.

Why not just use JSON output?

The virtual file editor pattern avoids re-emitting the full document, which reduces token cost and improves reliability on long texts. It also allows deterministic validation of the original text.

Is this a replacement for NER or classification models?

No. It is a text utility for structured annotation within a single document. It’s designed for ETL pipelines that need deterministic output and traceability, not generalized model training.