# Text extract Text extract is a reusable utility for extracting spans from long texts with a language model without requiring the model to re-emit every token. If you ask a model to "extract all the quotes" and return them as a list, you pay for every output token, and you risk the model hallucinating or paraphrasing the quotes. Text extract solves this by using the **virtual file pattern**. Biblicus asks the model to insert XML tags into an in-memory copy of the text. The model returns a small edit script (`str_replace` only), and Biblicus applies it and parses the result into spans. The model points to the text it wants to extract by wrapping it, without ever repeating the content. ## How text extract works 1) Biblicus loads the full text into memory. 2) The model receives the text and returns an **edit script** with str_replace operations. 3) Biblicus applies the operations and validates that only tags were inserted. 4) The marked-up string is parsed into ordered **spans**. The model never re-emits the full text, which lowers cost and reduces timeouts on long documents. ### Mechanism example Biblicus supplies the internal edit protocol and embeds the current text. This excerpt shows the protocol the model sees: **Internal protocol (excerpt):** ``` You are a virtual file editor. Use the available tools to edit the text. Interpret the word "return" in the user's request as: wrap the returned text with ... in-place in the current text. Current text: --- We run fast. --- ``` Then provide a short user prompt describing what to return: **User prompt:** ``` Return all the verbs. ``` The input text is the same content embedded in the internal protocol: **Input text:** ``` We run fast. ``` The model edits the virtual file by inserting tags in-place: **Marked-up text:** ``` We run fast. ``` Biblicus returns structured data parsed from the markup: **Structured data (result):** ``` { "marked_up_text": "We run fast.", "spans": [ {"index": 1, "start_char": 3, "end_char": 6, "text": "run"} ], "warnings": [] } ``` ## Data model Text extract uses Pydantic models for strict validation: - `TextExtractRequest`: input text + LLM config + prompt template. - `TextExtractResult`: marked-up text and extracted spans. If you override the internal protocol, your prompt must include `{text}`. Prompt templates support `{text_length}` placeholders (plus `{error}` for retry hints). Prompt templates must not include `{text}` and should only describe what to return. The structured output contains **spans only**. Any interstitial text remains in the marked-up string. Text extract expects tags to land on word boundaries. Prompts should instruct the model to avoid inserting tags inside words so spans stay aligned to human-readable text. Most callers only supply the user prompt and text. The internal protocol is built in; override it only when you need to customize the mechanics. ## Output contract Text extract is tool-driven. The model must use tool calls instead of returning JSON in the assistant message. Tool call arguments: ``` str_replace(old_str="Hello world", new_str="Hello world") done() ``` Rules: - Use the str_replace tool only. - Each old_str must match exactly once. - Each new_str must be the same text with span tags inserted. - Only `` and ``. - No modification of the original text. ## Example: Python API ``` from biblicus.ai.models import AiProvider, LlmClientConfig from biblicus.text import TextExtractRequest, apply_text_extract request = TextExtractRequest( text="Hello world", client=LlmClientConfig(provider=AiProvider.OPENAI, model="gpt-4o-mini"), prompt_template="Return the entire text.", ) result = apply_text_extract(request) ``` Example snippet: Internal protocol excerpt: **Internal protocol (excerpt):** ``` You are a virtual file editor. Use the available tools to edit the text. Interpret the word "return" in the user's request as: wrap the returned text with ... in-place in the current text. Current text: --- Hello world. --- ``` User prompt: **User prompt:** ``` Return the entire text. ``` Input text: **Input text:** ``` Hello world. ``` Marked-up text: **Marked-up text:** ``` Hello world. ``` Structured data: **Structured data (result):** ``` { "marked_up_text": "Hello world.", "spans": [ {"index": 1, "start_char": 0, "end_char": 12, "text": "Hello world."} ], "warnings": [] } ``` ## Example: Verb markup task ``` prompt_template = """ Return the verbs. Include auxiliary verbs and main verbs. Preserve all whitespace and punctuation. """.strip() request = TextExtractRequest( text="I can try to get help, but I promise nothing.", client=LlmClientConfig(provider=AiProvider.OPENAI, model="gpt-4o-mini"), prompt_template=prompt_template, ) result = apply_text_extract(request) ``` Example snippet: Internal protocol excerpt: **Internal protocol (excerpt):** ``` You are a virtual file editor. Use the available tools to edit the text. Interpret the word "return" in the user's request as: wrap the returned text with ... in-place in the current text. Current text: --- I can try to get help, but I promise nothing. --- ``` User prompt: **User prompt:** ``` Return the verbs. ``` Input text: **Input text:** ``` I can try to get help, but I promise nothing. ``` Marked-up text: **Marked-up text:** ``` I can try to get help, but I promise nothing. ``` Structured data: **Structured data (result):** ``` { "marked_up_text": "I can try to get help, but I promise nothing.", "spans": [ {"index": 1, "start_char": 2, "end_char": 5, "text": "can"}, {"index": 2, "start_char": 6, "end_char": 9, "text": "try"}, {"index": 3, "start_char": 13, "end_char": 16, "text": "get"}, {"index": 4, "start_char": 29, "end_char": 36, "text": "promise"} ], "warnings": [] } ``` ## Integration examples These examples mirror the integration tests and show the minimal user prompts that drive the tool loop. ### Extract paragraphs Input text: ``` Para one. || Para two. || Para three. ``` User prompt: ``` Return each paragraph. ``` Expected behavior: the model inserts `...` around each paragraph. Example snippet: Internal protocol excerpt: **Internal protocol (excerpt):** ``` You are a virtual file editor. Use the available tools to edit the text. Interpret the word "return" in the user's request as: wrap the returned text with ... in-place in the current text. Current text: --- Para one. || Para two. || Para three. --- ``` User prompt: **User prompt:** ``` Return each paragraph. ``` Input text: **Input text:** ``` Para one. || Para two. || Para three. ``` Marked-up text: **Marked-up text:** ``` Para one. || Para two. || Para three. ``` Structured data: **Structured data (result):** ``` { "marked_up_text": "Para one. || Para two. || Para three.", "spans": [ {"index": 1, "start_char": 0, "end_char": 9, "text": "Para one."}, {"index": 2, "start_char": 13, "end_char": 22, "text": "Para two."}, {"index": 3, "start_char": 26, "end_char": 37, "text": "Para three."} ], "warnings": [] } ``` ### Extract first sentences per paragraph Input text: ``` First one. Second one. || Alpha first. Alpha second. ``` User prompt: ``` Return the first sentence from each paragraph. ``` Expected behavior: the model wraps the first sentence of each paragraph in spans. Example snippet: Internal protocol excerpt: **Internal protocol (excerpt):** ``` You are a virtual file editor. Use the available tools to edit the text. Interpret the word "return" in the user's request as: wrap the returned text with ... in-place in the current text. Current text: --- First one. Second one. || Alpha first. Alpha second. --- ``` User prompt: **User prompt:** ``` Return the first sentence from each paragraph. ``` Input text: **Input text:** ``` First one. Second one. || Alpha first. Alpha second. ``` Marked-up text: **Marked-up text:** ``` First one. Second one. || Alpha first. Alpha second. ``` Structured data: **Structured data (result):** ``` { "marked_up_text": "First one. Second one. || Alpha first. Alpha second.", "spans": [ {"index": 1, "start_char": 0, "end_char": 10, "text": "First one."}, {"index": 2, "start_char": 27, "end_char": 39, "text": "Alpha first."} ], "warnings": [] } ``` ### Extract money quotes Input text: ``` She said "PAYMENT_QUOTE_001: I will pay $20 today." Then she left. ``` User prompt: ``` Return the quoted payment statement exactly as written, including the quotation marks. ``` Expected behavior: the quoted statement is wrapped in a span and preserved verbatim. Example snippet: Internal protocol excerpt: **Internal protocol (excerpt):** ``` You are a virtual file editor. Use the available tools to edit the text. Interpret the word "return" in the user's request as: wrap the returned text with ... in-place in the current text. Current text: --- She said "PAYMENT_QUOTE_001: I will pay $20 today." Then she left. --- ``` User prompt: **User prompt:** ``` Return the quoted payment statement exactly as written, including the quotation marks. ``` Input text: **Input text:** ``` She said "PAYMENT_QUOTE_001: I will pay $20 today." Then she left. ``` Marked-up text: **Marked-up text:** ``` She said "PAYMENT_QUOTE_001: I will pay $20 today." Then she left. ``` Structured data: **Structured data (result):** ``` { "marked_up_text": "She said \"PAYMENT_QUOTE_001: I will pay $20 today.\" Then she left.", "spans": [ {"index": 1, "start_char": 9, "end_char": 52, "text": "\"PAYMENT_QUOTE_001: I will pay $20 today.\""} ], "warnings": [] } ``` ### Extract verbs Input text: ``` We run fast. They agree. ``` User prompt: ``` Return all the verbs. ``` Expected behavior: each verb is wrapped in a span without splitting words. Example snippet: Internal protocol excerpt: **Internal protocol (excerpt):** ``` You are a virtual file editor. Use the available tools to edit the text. Interpret the word "return" in the user's request as: wrap the returned text with ... in-place in the current text. Current text: --- We run fast. They agree. --- ``` User prompt: **User prompt:** ``` Return all the verbs. ``` Input text: **Input text:** ``` We run fast. They agree. ``` Marked-up text: **Marked-up text:** ``` We run fast. They agree. ``` Structured data: **Structured data (result):** ``` { "marked_up_text": "We run fast. They agree.", "spans": [ {"index": 1, "start_char": 3, "end_char": 6, "text": "run"}, {"index": 2, "start_char": 23, "end_char": 28, "text": "agree"} ], "warnings": [] } ``` ### Extract grouped speaker statements Input text: ``` Agent: Hello. Agent: I can help. Customer: I need support. Customer: Thanks. ``` User prompt: ``` Return things that the agent said grouped together, and things the customer said in separate groups. ``` Expected behavior: agent text is grouped in one span and customer text in another. Example snippet: Internal protocol excerpt: **Internal protocol (excerpt):** ``` You are a virtual file editor. Use the available tools to edit the text. Interpret the word "return" in the user's request as: wrap the returned text with ... in-place in the current text. Current text: --- Agent: Hello. Agent: I can help. Customer: I need support. Customer: Thanks. --- ``` User prompt: **User prompt:** ``` Return things that the agent said grouped together, and things the customer said in separate groups. ``` Input text: **Input text:** ``` Agent: Hello. Agent: I can help. Customer: I need support. Customer: Thanks. ``` Marked-up text: **Marked-up text:** ``` Agent: Hello. Agent: I can help. Customer: I need support. Customer: Thanks. ``` Structured data: **Structured data (result):** ``` { "marked_up_text": "Agent: Hello. Agent: I can help. Customer: I need support. Customer: Thanks.", "spans": [ {"index": 1, "start_char": 0, "end_char": 33, "text": "Agent: Hello. Agent: I can help."}, {"index": 2, "start_char": 34, "end_char": 78, "text": "Customer: I need support. Customer: Thanks."} ], "warnings": [] } ``` ## Example: Markov analysis segmentation Use the `span_markup` segmentation method in Markov configurations. Example snippet: Internal protocol excerpt: **Internal protocol (excerpt):** ``` You are a virtual file editor. Use the available tools to edit the text. Interpret the word "return" in the user's request as: wrap the returned text with ... in-place in the current text. Current text: --- Greeting. Verification. Resolution. --- ``` User prompt: **User prompt:** ``` Return the segments that represent contiguous phases in the text. ``` Input text: **Input text:** ``` Greeting. Verification. Resolution. ``` Marked-up text: **Marked-up text:** ``` Greeting. Verification. Resolution. ``` Structured data: **Structured data (result):** ``` { "marked_up_text": "Greeting. Verification. Resolution.", "spans": [ {"index": 1, "start_char": 0, "end_char": 10, "text": "Greeting."}, {"index": 2, "start_char": 11, "end_char": 24, "text": "Verification."}, {"index": 3, "start_char": 25, "end_char": 37, "text": "Resolution."} ], "warnings": [] } ``` Configuration example (text extract provider-backed): ``` schema_version: 1 segmentation: method: span_markup span_markup: client: provider: openai model: gpt-4o-mini api_key: null response_format: json_object prompt_template: | Return the segments that represent contiguous phases in the text. Rules: - Preserve original order. - Do not add labels, summaries, or commentary. - Prefer natural boundaries like greeting/opening, identity verification, reason for call, clarification, resolution steps, handoff/escalation, closing. - Use speaker turn changes as possible boundaries, but keep multi-turn exchanges together if they form a single phase. - Avoid extremely short fragments; merge tiny leftovers into a neighboring span. model: family: gaussian n_states: 4 observations: encoder: tfidf ``` ## Validation rules Biblicus rejects: - Non-JSON responses. - Insertions that are not span tags. - Nested or unbalanced tags. - Any modification to the original text. ## Testing Text extract supports two modes of testing: - **Mocked unit tests** using a fake OpenAI client. - **Integration tests** that call the live model and apply real edits. Unit tests also assert the long-span behavior: the system prompt instructs the model to insert `` and `` in separate `str_replace` calls for long passages. See `tests/test_text_extract_tool_calls.py`. See `features/text_extract.feature` and `features/integration_text_extract.feature`.