# Text annotate
Text annotate is a reusable utility for attaching structured attributes (labels, phases, roles) to spans of text without re-emitting the document.
If you ask a model to "return all the verbs with their types" as a JSON list, you pay for the output tokens of every word, and you risk the model hallucinating words that aren't there.
Text annotate uses the **virtual file pattern** to solve this. Biblicus gives the model a virtual file and asks it to insert XML tags with attributes in-place (e.g., `...`). The model returns a small edit script (`str_replace` only), and Biblicus applies it and parses the result into structured, attributed spans. You get rich metadata without the cost or risk of text regeneration.
## How text annotate works
1) Biblicus loads the full text into memory.
2) The model receives the text and returns an **edit script** with str_replace operations.
3) Biblicus applies the operations and validates that only span tags were inserted.
4) The marked-up string is parsed into ordered **spans with attributes**.
The model never re-emits the full text. It only inserts tags in-place.
### Mechanism example
Biblicus supplies an internal protocol that defines the edit protocol, allowed attributes, and embeds the current text:
**Internal protocol (excerpt):**
```
You are a virtual file editor. Use the available tools to edit the text.
Interpret the word "return" in the user's request as: wrap the returned text with
... in-place in the current text.
Each span must include exactly one attribute. Allowed attributes: label, phase, role.
Current text:
---
We run fast.
---
```
Then provide a short user prompt describing what to return:
**User prompt:**
```
Return all the verbs.
```
The input text is the same content embedded in the internal protocol:
**Input text:**
```
We run fast.
```
The model edits the virtual file by inserting tags in-place:
**Marked-up text:**
```
We run fast.
```
Biblicus returns structured data parsed from the markup:
**Structured data (result):**
```
{
"marked_up_text": "We run fast.",
"spans": [
{
"index": 1,
"start_char": 3,
"end_char": 6,
"text": "run",
"attributes": {"label": "verb"}
}
],
"warnings": []
}
```
## Data model
Text annotate uses Pydantic models for strict validation:
- `TextAnnotateRequest`: input text + LLM config + prompt template + allowed attributes.
- `TextAnnotateResult`: marked-up text and attributed spans.
Internal protocol templates (advanced overrides) must include `{text}`. Prompt templates must not include `{text}` and
should only describe what to return. The internal protocol template can interpolate the allowed attributes list via
Jinja2.
Most callers only supply the user prompt and text. Override `system_prompt` only when you need to customize the edit
protocol.
## Output contract
Text annotate is tool-driven. The model must use tool calls instead of returning JSON in the assistant message.
Tool call arguments:
```
str_replace(old_str="Hello", new_str="Hello")
done()
```
Rules:
- Use the str_replace tool only.
- Each old_str must match exactly once.
- Each new_str must be the same text with span tags inserted.
- Only `` and `` tags are allowed.
- Each span must include exactly one attribute.
- Attributes must be on the allow list.
Long-span handling: the system prompt instructs the model to insert `` and `` in separate `str_replace` calls for long passages (single-call insertion is allowed for short spans). This is covered by unit tests in `tests/test_text_utility_tool_calls.py`.
## Example: Python API
```python
from biblicus.ai.models import AiProvider, LlmClientConfig
from biblicus.text import TextAnnotateRequest, apply_text_annotate
request = TextAnnotateRequest(
text="We run fast.",
client=LlmClientConfig(provider=AiProvider.OPENAI, model="gpt-4o-mini"),
prompt_template="Return all the verbs.",
allowed_attributes=["label", "phase", "role"],
)
result = apply_text_annotate(request)
```
## Concept: Text Annotate FAQ
### What problem does this solve?
Some ETL pipelines need **labeled spans**, not just extracted spans. For example, you may want phases of a conversation (greeting, verification, resolution) or roles (agent vs customer). Existing extract/slice utilities can return spans, but they cannot attach structured attributes in a consistent way. Text annotate provides a standardized way to attach attributes while preserving the original text.
### How is it different from text extract?
- **Extract** returns spans only (no attributes).
- **Annotate** returns spans with attributes (when labels are required).
This keeps prompts simple for small models by using annotate only when needed.
### Why not just use JSON output?
The virtual file editor pattern avoids re-emitting the full document, which reduces token cost and improves reliability on long texts. It also allows deterministic validation of the original text.
### Is this a replacement for NER or classification models?
No. It is a **text utility** for structured annotation within a single document. It’s designed for ETL pipelines that need deterministic output and traceability, not generalized model training.