# Text annotate Text annotate is a reusable utility for attaching structured attributes (labels, phases, roles) to spans of text without re-emitting the document. If you ask a model to "return all the verbs with their types" as a JSON list, you pay for the output tokens of every word, and you risk the model hallucinating words that aren't there. Text annotate uses the **virtual file pattern** to solve this. Biblicus gives the model a virtual file and asks it to insert XML tags with attributes in-place (e.g., `...`). The model returns a small edit script (`str_replace` only), and Biblicus applies it and parses the result into structured, attributed spans. You get rich metadata without the cost or risk of text regeneration. ## How text annotate works 1) Biblicus loads the full text into memory. 2) The model receives the text and returns an **edit script** with str_replace operations. 3) Biblicus applies the operations and validates that only span tags were inserted. 4) The marked-up string is parsed into ordered **spans with attributes**. The model never re-emits the full text. It only inserts tags in-place. ### Mechanism example Biblicus supplies an internal protocol that defines the edit protocol, allowed attributes, and embeds the current text: **Internal protocol (excerpt):** ``` You are a virtual file editor. Use the available tools to edit the text. Interpret the word "return" in the user's request as: wrap the returned text with ... in-place in the current text. Each span must include exactly one attribute. Allowed attributes: label, phase, role. Current text: --- We run fast. --- ``` Then provide a short user prompt describing what to return: **User prompt:** ``` Return all the verbs. ``` The input text is the same content embedded in the internal protocol: **Input text:** ``` We run fast. ``` The model edits the virtual file by inserting tags in-place: **Marked-up text:** ``` We run fast. ``` Biblicus returns structured data parsed from the markup: **Structured data (result):** ``` { "marked_up_text": "We run fast.", "spans": [ { "index": 1, "start_char": 3, "end_char": 6, "text": "run", "attributes": {"label": "verb"} } ], "warnings": [] } ``` ## Data model Text annotate uses Pydantic models for strict validation: - `TextAnnotateRequest`: input text + LLM config + prompt template + allowed attributes. - `TextAnnotateResult`: marked-up text and attributed spans. Internal protocol templates (advanced overrides) must include `{text}`. Prompt templates must not include `{text}` and should only describe what to return. The internal protocol template can interpolate the allowed attributes list via Jinja2. Most callers only supply the user prompt and text. Override `system_prompt` only when you need to customize the edit protocol. ## Output contract Text annotate is tool-driven. The model must use tool calls instead of returning JSON in the assistant message. Tool call arguments: ``` str_replace(old_str="Hello", new_str="Hello") done() ``` Rules: - Use the str_replace tool only. - Each old_str must match exactly once. - Each new_str must be the same text with span tags inserted. - Only `` and `` tags are allowed. - Each span must include exactly one attribute. - Attributes must be on the allow list. Long-span handling: the system prompt instructs the model to insert `` and `` in separate `str_replace` calls for long passages (single-call insertion is allowed for short spans). This is covered by unit tests in `tests/test_text_utility_tool_calls.py`. ## Example: Python API ```python from biblicus.ai.models import AiProvider, LlmClientConfig from biblicus.text import TextAnnotateRequest, apply_text_annotate request = TextAnnotateRequest( text="We run fast.", client=LlmClientConfig(provider=AiProvider.OPENAI, model="gpt-4o-mini"), prompt_template="Return all the verbs.", allowed_attributes=["label", "phase", "role"], ) result = apply_text_annotate(request) ``` ## Concept: Text Annotate FAQ ### What problem does this solve? Some ETL pipelines need **labeled spans**, not just extracted spans. For example, you may want phases of a conversation (greeting, verification, resolution) or roles (agent vs customer). Existing extract/slice utilities can return spans, but they cannot attach structured attributes in a consistent way. Text annotate provides a standardized way to attach attributes while preserving the original text. ### How is it different from text extract? - **Extract** returns spans only (no attributes). - **Annotate** returns spans with attributes (when labels are required). This keeps prompts simple for small models by using annotate only when needed. ### Why not just use JSON output? The virtual file editor pattern avoids re-emitting the full document, which reduces token cost and improves reliability on long texts. It also allows deterministic validation of the original text. ### Is this a replacement for NER or classification models? No. It is a **text utility** for structured annotation within a single document. It’s designed for ETL pipelines that need deterministic output and traceability, not generalized model training.