Biblicus

Tutorials

  • Use Cases
    • Notes to Context Pack
      • Run it
      • What you should see
      • How it works
    • Folder Search With Extraction
      • Run it
      • What you should see
      • How it works
    • Mark Sensitive Text for Redaction
      • Run it (mock mode)
      • Run it (real model)
      • What you should see
    • Sequence Graph With Markov Analysis
      • Run it
      • What you should see
      • How to interpret the output
    • How to run the tutorials

Concepts

  • Text Extraction Pipeline
    • What extraction produces
      • Output structure
    • Reproducibility checklist
    • Available Extractors
      • Text & Document Processing
      • Optical Character Recognition
      • Vision-Language Models
      • Speech-to-Text
      • Pipeline Utilities
    • How selection chooses text
    • Pipeline extractor
    • Complementary versus competing extractors
    • Example: extract from a corpus
    • Example: selection within a pipeline
    • Example: PDF with OCR fallback
    • Example: VLM for complex documents
    • Inspecting and deleting extraction snapshots
    • Common pitfalls
    • Use extracted text in retrieval
    • Evaluate extraction quality
    • What extraction is not
  • Retrieval
    • Retrieval concepts
    • How retrieval snapshots work
    • A minimal run you can execute
    • Backends
    • Choosing a backend
    • Evaluation
    • Evidence inspection workflow
    • Saving evidence for later analysis
    • Labs and demos
    • Reproducibility checklist
    • Why the separation matters
    • Retrieval quality
  • Corpus analysis
    • How analysis snapshots work
    • Analysis snapshot artifacts
    • Inspecting output
    • Comparing analysis snapshots
    • Pluggable analysis backends
    • Choosing an analysis backend
    • Configuration files
    • Topic modeling
    • Markov analysis
    • Profiling analysis
      • Minimal profiling run
  • Graph extraction
    • Where graph extraction sits
    • Why use graph extraction
    • Core concepts
    • Deterministic identifiers
    • Baseline graph extractors
    • Graph storage model
    • Graph extractor interface
    • Running graph extraction
    • Example configurations
    • Simple entities extractor
    • NER entities extractor
    • Dependency relations extractor
    • Querying a logical graph
    • Graph-aware retrieval
    • Local Neo4j setup
    • Repeatable integration script
    • Narrative demo script
    • Narrative demo script (all extractors)
    • Reproducibility checklist
    • Next steps

Core Building Blocks

  • Corpus design
    • What exists today
    • Core vocabulary for this document
    • Day to day corpus workflows
    • Decision points with options and recommendations
    • Locked decisions
      • Decision 1: corpus ignore rules
      • Decision 2: large item ingestion and streaming
      • Decision 3: content aware filename and media type detection
      • Decision 4: folder tree import semantics
      • Decision 5: website crawl scope and safety
      • Decision 6: editorial workflow and reversible pruning
      • Decision 6A: derived artifact storage is partitioned by plugin type
      • Decision 6B: extraction is a separate plugin stage from retrieval
    • Lifecycle hooks and where plugins can attach
      • Hook points to consider
      • Decision 7: hook protocol design
      • Decision 8: how hook execution is recorded
    • Outcomes and remaining questions
      • Hook contexts implemented in version zero
      • Hook log schema implemented in version zero
      • Remaining design questions
    • First behavior driven development slices implemented in version zero
    • Reproducibility checklist
    • Common pitfalls
  • Knowledge base
    • What it does
    • Minimal use
    • Default behavior
    • Output structure
    • Overrides
    • How it relates to lower‑level control
    • Reproducibility checklist
    • Common pitfalls
  • Adding a Retrieval Backend
    • Backend contract
    • Run artifacts
    • Implementation checklist
    • Design notes
    • Reproducibility checklist
    • Common pitfalls
    • Examples
  • Retrieval Backends
    • Scan Backend
      • Overview
      • Installation
      • When to Use
        • Good Use Cases
        • Not Recommended For
      • Configuration
        • Config Schema
        • Configuration Options
      • Usage
        • Command Line
        • Python API
        • With Extraction Runs
      • How It Works
        • Query Processing
        • Scoring Algorithm
        • Snippet Extraction
      • Performance
        • Build Time
        • Query Time
        • Memory Usage
        • Disk Usage
      • Examples
        • Quick Development Search
        • Baseline Comparison
        • Ad-hoc Exploration
      • Limitations
        • Scalability
        • Ranking Quality
        • Query Features
      • When to Upgrade
      • Error Handling
        • Missing Extraction Run
        • Non-Text Items
      • Statistics
      • Related Backends
      • See Also
    • SQLite Full-Text Search Backend
      • Overview
      • Installation
        • Requirements
        • Verify FTS5 Support
      • When to Use
        • Good Use Cases
        • Not Recommended For
      • Configuration
        • Config Schema
        • Configuration Options
        • Chunking Strategy
      • Usage
        • Command Line
        • Python API
        • With Extraction Runs
      • How It Works
        • Index Building
        • Query Processing
        • BM25 Ranking
      • Performance
        • Build Time
        • Query Time
        • Memory Usage
        • Disk Usage
      • Examples
        • Production Deployment
        • Tuned for Large Documents
        • Multi-Format Corpus
        • Query with Context
      • Advanced Configuration
        • SQLite Query Syntax
        • Rebuilding Indexes
      • Limitations
        • Query Features
        • Scalability
        • Ranking
      • When to Upgrade
      • Error Handling
        • FTS5 Not Available
        • Missing Extraction Run
        • Invalid Configuration
      • Statistics
      • Index Artifacts
      • Related Backends
      • See Also
    • TF Vector backend
      • When to use it
      • Backend ID
      • How it works
      • Configuration
      • Build a run
      • Query a run
      • What it is not
    • Embedding index (in-memory)
      • Backend ID
      • What it builds
      • Chunking
      • Dependencies
    • Embedding index (file-backed)
      • Backend ID
      • What it builds
      • Chunking
      • Dependencies
    • Available Backends
      • scan
      • sqlite-full-text-search
      • tf-vector
      • embedding-index-inmemory
      • embedding-index-file
    • Quick Start
      • Installation
      • Basic Usage
        • Command Line
        • Python API
    • Choosing a Backend
    • Reproducibility checklist
    • Common pitfalls
    • Performance Comparison
      • Scan Backend
      • SQLite Full-Text Search Backend
    • Common Patterns
      • Development Workflow
      • Production Deployment
      • Baseline Comparison
      • Using Extracted Text
    • Backend Configuration
      • Common Configuration Options
      • Backend-Specific Options
        • Scan Backend
        • SQLite Full-Text Search Backend
    • Architecture
      • Backend Interface
      • Evidence Model
    • Implementing Custom Backends
    • See Also
  • Context packs
    • Minimal policy
      • Output structure
      • Before and after example
    • Policy surfaces
      • Ordering
      • Metadata inclusion
      • Character budgets
    • Command-line interface
    • Reproducibility checklist
    • What context pack building does
    • Common pitfalls
    • Token budgets
  • Context Engine
    • Why Context Engine?
    • Core Concepts
    • Basic Usage
    • Retriever Packs
    • Expansion and Pagination
    • Compaction Strategies
    • FAQ
      • What does “elastic” mean?
      • How is pagination used?
      • Does this replace Context packs?

Extraction and Ingestion

  • Text Extraction Pipeline
    • What extraction produces
      • Output structure
    • Reproducibility checklist
    • Available Extractors
      • Text & Document Processing
      • Optical Character Recognition
      • Vision-Language Models
      • Speech-to-Text
      • Pipeline Utilities
    • How selection chooses text
    • Pipeline extractor
    • Complementary versus competing extractors
    • Example: extract from a corpus
    • Example: selection within a pipeline
    • Example: PDF with OCR fallback
    • Example: VLM for complex documents
    • Inspecting and deleting extraction snapshots
    • Common pitfalls
    • Use extracted text in retrieval
    • Evaluate extraction quality
    • What extraction is not
  • Extraction evaluation
    • What extraction evaluation measures
    • Dataset format
    • Run extraction evaluation from the CLI
    • Run extraction evaluation from Python
    • Output location
    • Reading the output
    • Working demo
    • Extraction evaluation lab
      • Lab walkthrough
    • Interpretation tips
    • Common pitfalls
  • Text Extractors
    • Text & Document Processing
      • Pass-Through Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Behavior Details
        • Performance
        • Error Handling
        • Use Cases
        • Related Extractors
        • See Also
      • Metadata Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Output Format
        • Behavior Details
        • Performance
        • Error Handling
        • Use Cases
        • Best Practices
        • Related Extractors
        • See Also
      • PDF Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Behavior Details
        • Performance
        • Error Handling
        • Use Cases
        • When to Use PDF Text vs Alternatives
        • Best Practices
        • Related Extractors
        • See Also
      • MarkItDown Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Output Format
        • Performance
        • Error Handling
        • Use Cases
        • When to Use MarkItDown vs Alternatives
        • Best Practices
        • Related Extractors
        • See Also
      • Unstructured Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Output Format
        • Performance
        • Error Handling
        • Use Cases
        • When to Use Unstructured vs Alternatives
        • Best Practices
        • Comparison with Other Extractors
        • Related Extractors
        • See Also
      • Overview
      • Available Extractors
        • pass-through-text
        • metadata-text
        • pdf-text
        • markitdown
        • unstructured
      • Choosing an Extractor
      • Common Patterns
        • Fallback Chain
        • Metadata + Content
      • See Also
    • Optical Character Recognition (OCR)
      • RapidOCR Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Confidence Scores
        • Performance
        • Error Handling
        • Use Cases
        • When to Use RapidOCR vs Alternatives
        • Best Practices
        • Image Quality Tips
        • Related Extractors
        • See Also
      • PaddleOCR-VL Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Inference Backends
        • Language Support
        • Performance
        • Error Handling
        • Use Cases
        • When to Use PaddleOCR-VL vs Alternatives
        • Configuration via User Config
        • Best Practices
        • Related Extractors
        • See Also
      • Biblicus Document Understanding Benchmark
        • Why a Multi-Category Benchmark?
        • Quick Start
        • Document Categories
        • Metrics
        • Scoring Strategy
        • Running Benchmarks
        • Understanding Results
        • Pipeline Recommendations
        • Adding Custom Pipelines
        • Dataset Downloads
        • Licensing
        • See Also
      • OCR Pipeline Benchmarking Guide
        • Table of Contents
        • Overview
        • Quick Start
        • Benchmark Dataset
        • Available Pipelines
        • Understanding Metrics
        • Running Benchmarks
        • Custom Pipelines
        • Results Analysis
        • Dependencies
        • Troubleshooting
        • See Also
      • Biblicus Benchmark Results
        • Executive Summary
        • Forms Category (FUNSD) - Latest Results
        • Receipts Category (SROIE) - Previous Results
        • Pipeline Comparison Summary
        • Recommendations by Use Case
        • Metric Interpretation Guide
        • Benchmark Reproducibility
        • Understanding Pipeline Trade-offs
        • Known Issues and Limitations
        • Future Work
        • See Also
      • Heron Layout Detection Implementation - COMPLETE
        • Summary
        • What is Heron?
        • Implementation
        • workflow-based Complete Workflow Status
        • Benchmark Results
        • Key Findings
        • When to Use Heron
        • Architecture
        • Usage
        • Dependencies
        • Comparison with Original Research
        • Future Work
        • References
        • Conclusion
      • Layout-Aware OCR Implementation Results
        • Implementation
        • Benchmark Results
        • Key Findings
        • Analysis
        • Comparison with PaddleOCR Direct
        • Implementation Status
        • Files Modified/Created
        • Conclusion
      • Overview
      • Available Extractors
        • ocr-rapidocr
        • ocr-paddleocr-vl
      • OCR vs VLM Document Understanding
        • When to Use OCR
        • When to Use VLM
      • Choosing an Extractor
      • Common Patterns
        • Fallback Chain
        • Multi-Strategy Selection
        • Document Type Routing
      • Performance Considerations
        • RapidOCR
        • PaddleOCR VL
        • VLM Alternatives
      • See Also
    • Vision-Language Models (VLM) for Document Understanding
      • SmolDocling-256M Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Performance
        • Error Handling
        • Related Extractors
        • See Also
      • Granite Docling-258M Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Performance
        • Error Handling
        • Use Cases
        • Related Extractors
        • See Also
      • Overview
      • Available Extractors
        • docling-smol
        • docling-granite
      • VLM vs Traditional OCR
        • Use VLM When:
        • Use Traditional OCR When:
      • Choosing a VLM Extractor
      • Performance Comparison
        • SmolDocling-256M
        • Granite Docling-258M
      • Backend Options
        • MLX Backend (Apple Silicon)
        • Transformers Backend (Cross-Platform)
      • Output Formats
        • Markdown (Default)
        • HTML
        • Plain Text
      • Common Patterns
        • Fallback to OCR
        • Speed vs Accuracy Trade-off
        • Backend Selection
      • Installation Guide
        • Apple Silicon (Recommended)
        • Other Platforms
        • Both Extractors
      • Supported Document Types
      • See Also
    • Speech-to-Text (STT)
      • OpenAI Whisper Speech-to-Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • API Configuration
        • Language Support
        • Performance
        • Error Handling
        • Hallucination Suppression
        • Prompt Guidance
        • Use Cases
        • When to Use OpenAI vs Deepgram
        • Best Practices
        • Related Extractors
        • See Also
      • Deepgram Speech-to-Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • API Configuration
        • Language Support
        • Smart Formatting
        • Speaker Diarization
        • Structured Metadata
        • Performance
        • Error Handling
        • Use Cases
        • When to Use Deepgram vs OpenAI
        • Best Practices
        • Advanced Features
        • Related Extractors
        • See Also
      • Aldea Speech-to-Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Authentication
        • Response and Metadata
        • Error Handling
        • See Also
      • Deepgram Transform Extractor
        • Overview
        • Installation
        • Configuration
        • Usage
        • Notes
      • Overview
      • Available Extractors
        • stt-openai
        • stt-deepgram
        • stt-aldea
        • deepgram-transform
      • Choosing an Extractor
      • Performance Comparison
        • OpenAI Whisper
        • Deepgram Nova-3
      • Common Patterns
        • Fallback Chain
        • Language-Specific Routing
        • Speaker Diarization
      • Authentication
        • Environment Variables
        • Configuration File
      • Supported Audio Formats
      • See Also
    • Pipeline Utilities
      • Select Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Selection Rules
        • Usage
        • Examples
        • Behavior Details
        • When to Use Select-Text
        • Best Practices
        • Use Cases
        • Comparison with Other Selectors
        • Related Extractors
        • See Also
      • Select Longest Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Selection Rules
        • Usage
        • Examples
        • Behavior Details
        • When to Use Select-Longest-Text
        • Best Practices
        • Use Cases
        • Comparison with Other Selectors
        • Performance Considerations
        • Related Extractors
        • See Also
      • Select Override Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Selection Rules
        • Usage
        • Examples
        • Behavior Details
        • When to Use Select-Override
        • Best Practices
        • Use Cases
        • Comparison with Other Selectors
        • Related Extractors
        • See Also
      • Select Smart Override Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Selection Rules
        • Usage
        • Examples
        • Behavior Details
        • When to Use Select-Smart-Override
        • Best Practices
        • Use Cases
        • Tuning Guidelines
        • Comparison with Other Selectors
        • Related Extractors
        • See Also
      • Pipeline Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Pipeline Patterns
        • Examples
        • Behavior Details
        • Performance Considerations
        • Best Practices
        • Common Pipeline Recipes
        • Limitations
        • Related Extractors
        • See Also
      • Overview
      • Available Extractors
        • Selection Extractors
        • Override Extractors
        • Composition Extractor
      • Common Patterns
        • PDF Fallback Strategy
        • Media Type Routing
        • Maximum Coverage
        • Metadata + Content
        • Selective Override
      • Decision Tree
        • Which Utility Extractor to Use?
      • Performance Considerations
        • select-text (Short-Circuit)
        • select-longest-text (Parallel)
        • select-override (Conditional)
        • select-smart-override (Pattern-Based)
        • pipeline (Sequential)
      • See Also
    • Extractor Categories
      • Text & Document Processing
      • Optical Character Recognition (OCR)
      • Vision-Language Models (VLM)
      • Speech-to-Text (STT)
      • Pipeline Utilities
    • Quick Start
      • Installation
      • Basic Usage
        • Command Line
        • Python API
    • Choosing an Extractor
      • For PDF Documents
      • For Office Documents
      • For Images
      • For Audio/Video
      • For Multiple Strategies
    • See Also

Retrieval and Evaluation

  • Retrieval
    • Retrieval concepts
    • How retrieval snapshots work
    • A minimal run you can execute
    • Backends
    • Choosing a backend
    • Evaluation
    • Evidence inspection workflow
    • Saving evidence for later analysis
    • Labs and demos
    • Reproducibility checklist
    • Why the separation matters
    • Retrieval quality
  • Retrieval quality upgrades
    • Goals
    • Available upgrades
      • 1) Tuned lexical baseline
      • 2) Reranking stage
      • 3) Hybrid retrieval
    • Evaluation guidance
    • Interpreting evidence signals
    • Budget awareness
    • Evidence tracing checklist
    • Non-goals
    • Summary
  • Retrieval evaluation
    • Dataset format
    • Metrics primer
    • Running an evaluation
    • End-to-end evaluation example
    • Authoring a dataset from a corpus
    • Retrieval evaluation lab
    • Output
    • Reading per-query diagnostics
    • What to record for comparisons
    • Common pitfalls
    • Python usage
    • Design notes
  • Embedding Retrieval
    • Why embedding retrieval?
      • What problem does this solve?
    • Concepts
    • A local, textbook embedding index
    • Build and query
    • Evidence and provenance

Analysis and Modeling

  • Topic modeling
    • What topic modeling does
    • About BERTopic
    • Pipeline stages
    • Run topic modeling from the CLI
    • Output structure
      • Reading a topic record
    • Configuration reference
      • Text source
      • LLM extraction
      • Entity removal
      • Lexical processing
      • BERTopic configuration
      • LLM fine-tuning
    • Vectorizer configuration
    • Repeatable integration script
      • Example: raise topic count
      • Example: disable lexical processing and restrict inputs
      • Example: keep lexical processing but preserve punctuation
    • Tuning workflow
    • Interpreting results
    • Common pitfalls
  • Markov analysis
    • Observation encoder configurability
    • Topic-driven observations
    • What Markov analysis does
    • Text extract segmentation
    • Run Markov analysis from the CLI
      • Cascading configurations and CLI overrides
    • LLM observation cache
    • Output location and artifacts
    • Working demo

Tools

  • Utilities
    • Current utility families
    • Design stance
  • Text utilities
    • The old way: “Just return JSON”
    • The Virtual File Pattern
    • A set of useful patterns
    • The virtual file editor pattern
    • Efficiently Handling Long Spans
      • Mechanism example
    • Prompting Paradigm
    • Validation and confirmation
    • Human-facing utilities
    • Reliable coordination: system prompts and feedback
      • Built-in safeguards
      • Feedback examples (retry stories)
      • Why the safeguards matter
    • Where to go next
  • Entity removal
    • When to use it
    • Installation
    • Configuration
      • Notes
    • Output visibility
  • Text extract
    • How text extract works
      • Mechanism example
    • Data model
    • Output contract
    • Example: Python API
    • Example: Verb markup task
    • Integration examples
      • Extract paragraphs
      • Extract first sentences per paragraph
      • Extract money quotes
      • Extract verbs
      • Extract grouped speaker statements
    • Example: Markov analysis segmentation
    • Validation rules
    • Testing
  • Text slice
    • How text slice works
      • Mechanism example
    • Data model
    • Output contract
    • Example: Python API
    • Integration examples
      • Slice sentences
      • Slice by speaker grouping
    • Validation rules
    • Testing
  • Text annotate
    • How text annotate works
      • Mechanism example
    • Data model
    • Output contract
    • Example: Python API
    • Concept: Text Annotate FAQ
      • What problem does this solve?
      • How is it different from text extract?
      • Why not just use JSON output?
      • Is this a replacement for NER or classification models?
  • Text redact
    • How text redact works
      • Mechanism example
      • Redaction types example
    • Data model
    • Output contract
    • Example: Python API
  • Text link
    • How text link works
      • Mechanism example
    • Data model
    • Output contract
    • Example: Python API

Operations and Demos

  • Demos
    • Working examples you can run now
      • Install for local development
      • Create a corpus and ingest a few items
      • Show an item
      • Edit raw files and reindex
      • Crawl a website prefix
      • Build an extraction snapshot
      • Graph extraction demo
      • Graph extraction integration run
      • Graph extraction baselines
      • Graph extractor narrative demos
      • Graph extractor narrative demos (all extractors)
      • Topic modeling integration run
      • Extraction evaluation demo run
      • Extraction evaluation lab run
      • Retrieval evaluation lab run
      • Profiling analysis demo
      • Select extracted text within a pipeline
      • Portable Document Format extraction and retrieval
      • MarkItDown extraction demo (Python 3.10+)
      • Mixed modality integration corpus
      • Image samples (for optical character recognition experiments)
      • Optional: Unstructured as a last-resort extractor
      • Optional: Speech to text for audio items
      • Build and query the minimal backend
      • Build and query the practical backend
      • Run the test suite and view coverage
    • Documentation map
  • User configuration
    • Where it looks
    • File format
    • Reproducibility notes
    • Example: OpenAI speech to text
    • Example: Deepgram speech to text
    • Example: Aldea speech to text
    • Source profiles (remote collections and corpora)
    • Example: Neo4j graph extraction
    • Common pitfalls

Reference

  • Feature index
    • Corpus
    • Import and ignore rules
    • Streaming ingest
    • Lifecycle hooks
    • User configuration files
    • Text extraction stage
    • Extraction evaluation
    • Graph extraction stage
    • Retrieval backends
    • Evaluation
    • Context packs
    • Context engine
    • Knowledge base
    • Text utilities
    • Text extract
    • Text slice
    • Text annotate
    • Text redact
    • Text link
    • Testing, coverage, and documentation build
    • Integration corpora
  • Roadmap
    • Principles
    • Completed foundations
      • Retrieval evaluation and datasets
      • Retrieval quality upgrades
      • Context pack policy surfaces
      • Extraction evaluation harness
      • Corpus analysis tools
      • Sequence analysis (Markov analysis)
      • Text utilities
    • Next: Tactus integration
    • Later: alternate backends and hosting modes
    • Deferred: corpus and extraction work
      • In-memory corpus for ephemeral workflows
  • Biblicus Architecture
    • Core Concepts
    • Design Principles
    • The Python Developer Mental Model
    • Evidence Lifecycle
    • Relationship to Agent Frameworks
    • Where to go next
  • Application Programming Interface Reference
    • Core
      • Corpus
        • Corpus.analysis_dir
        • Corpus.analysis_run_dir()
        • Corpus.analysis_runs_dir
        • Corpus.catalog_generated_at()
        • Corpus.catalog_path
        • Corpus.create_crawl_id()
        • Corpus.delete_extraction_snapshot()
        • Corpus.extracted_dir
        • Corpus.extraction_snapshot_dir()
        • Corpus.extraction_snapshots_dir
        • Corpus.find()
        • Corpus.get_item()
        • Corpus.graph_dir
        • Corpus.graph_snapshot_dir()
        • Corpus.graph_snapshots_dir
        • Corpus.has_items()
        • Corpus.import_tree()
        • Corpus.ingest_crawled_payload()
        • Corpus.ingest_item()
        • Corpus.ingest_item_stream()
        • Corpus.ingest_note()
        • Corpus.ingest_source()
        • Corpus.init()
        • Corpus.latest_extraction_snapshot_reference()
        • Corpus.latest_snapshot_id
        • Corpus.list_extraction_snapshots()
        • Corpus.list_items()
        • Corpus.load_catalog()
        • Corpus.load_extraction_snapshot_manifest()
        • Corpus.load_snapshot()
        • Corpus.name
        • Corpus.open()
        • Corpus.pull_source()
        • Corpus.purge()
        • Corpus.read_extracted_text()
        • Corpus.reindex()
        • Corpus.retrieval_dir
        • Corpus.snapshots_dir
        • Corpus.uri
        • Corpus.write_snapshot()
      • KnowledgeBase
        • KnowledgeBase.context_pack()
        • KnowledgeBase.corpus
        • KnowledgeBase.defaults
        • KnowledgeBase.from_folder()
        • KnowledgeBase.query()
        • KnowledgeBase.retriever_id
        • KnowledgeBase.snapshot
      • KnowledgeBaseDefaults
        • KnowledgeBaseDefaults.configuration_name
        • KnowledgeBaseDefaults.model_config
        • KnowledgeBaseDefaults.query_budget
        • KnowledgeBaseDefaults.retriever_id
        • KnowledgeBaseDefaults.tags
      • CatalogItem
        • CatalogItem.bytes
        • CatalogItem.created_at
        • CatalogItem.id
        • CatalogItem.media_type
        • CatalogItem.metadata
        • CatalogItem.model_config
        • CatalogItem.relpath
        • CatalogItem.sha256
        • CatalogItem.source_uri
        • CatalogItem.tags
        • CatalogItem.title
      • CollectionMembership
        • CollectionMembership.collection_name
        • CollectionMembership.corpus_name
        • CollectionMembership.model_config
      • ConfigurationManifest
        • ConfigurationManifest.configuration
        • ConfigurationManifest.configuration_id
        • ConfigurationManifest.created_at
        • ConfigurationManifest.description
        • ConfigurationManifest.model_config
        • ConfigurationManifest.name
        • ConfigurationManifest.retriever_id
      • CorpusCatalog
        • CorpusCatalog.corpus_uri
        • CorpusCatalog.generated_at
        • CorpusCatalog.items
        • CorpusCatalog.latest_run_id
        • CorpusCatalog.latest_snapshot_id
        • CorpusCatalog.model_config
        • CorpusCatalog.order
        • CorpusCatalog.raw_dir
        • CorpusCatalog.schema_version
      • CorpusConfig
        • CorpusConfig.collection
        • CorpusConfig.corpus_uri
        • CorpusConfig.created_at
        • CorpusConfig.hooks
        • CorpusConfig.model_config
        • CorpusConfig.notes
        • CorpusConfig.raw_dir
        • CorpusConfig.schema_version
        • CorpusConfig.source
      • Evidence
        • Evidence.configuration_id
        • Evidence.content_ref
        • Evidence.hash
        • Evidence.item_id
        • Evidence.media_type
        • Evidence.metadata
        • Evidence.model_config
        • Evidence.rank
        • Evidence.score
        • Evidence.snapshot_id
        • Evidence.source_uri
        • Evidence.span_end
        • Evidence.span_start
        • Evidence.stage
        • Evidence.stage_scores
        • Evidence.text
      • ExtractedText
        • ExtractedText.confidence
        • ExtractedText.metadata
        • ExtractedText.model_config
        • ExtractedText.producer_extractor_id
        • ExtractedText.source_stage_index
        • ExtractedText.text
      • ExtractionSnapshotListEntry
        • ExtractionSnapshotListEntry.catalog_generated_at
        • ExtractionSnapshotListEntry.configuration_id
        • ExtractionSnapshotListEntry.configuration_name
        • ExtractionSnapshotListEntry.created_at
        • ExtractionSnapshotListEntry.extractor_id
        • ExtractionSnapshotListEntry.model_config
        • ExtractionSnapshotListEntry.snapshot_id
        • ExtractionSnapshotListEntry.stats
      • ExtractionSnapshotReference
        • ExtractionSnapshotReference.as_string()
        • ExtractionSnapshotReference.extractor_id
        • ExtractionSnapshotReference.model_config
        • ExtractionSnapshotReference.snapshot_id
      • ExtractionStageOutput
        • ExtractionStageOutput.confidence
        • ExtractionStageOutput.error_message
        • ExtractionStageOutput.error_type
        • ExtractionStageOutput.extractor_id
        • ExtractionStageOutput.metadata
        • ExtractionStageOutput.model_config
        • ExtractionStageOutput.producer_extractor_id
        • ExtractionStageOutput.source_stage_index
        • ExtractionStageOutput.stage_index
        • ExtractionStageOutput.status
        • ExtractionStageOutput.text
        • ExtractionStageOutput.text_characters
      • IngestResult
        • IngestResult.item_id
        • IngestResult.model_config
        • IngestResult.relpath
        • IngestResult.sha256
      • PipelineAnalysisConfig
        • PipelineAnalysisConfig.configuration
        • PipelineAnalysisConfig.kind
        • PipelineAnalysisConfig.model_config
      • PipelineCorpusSelector
        • PipelineCorpusSelector.collection
        • PipelineCorpusSelector.model_config
        • PipelineCorpusSelector.path
        • PipelineCorpusSelector.selector
      • PipelineExtractionConfig
        • PipelineExtractionConfig.model_config
        • PipelineExtractionConfig.recipe
      • PipelineMirrorConfig
        • PipelineMirrorConfig.collection
        • PipelineMirrorConfig.model_config
      • PipelineRecipeConfig
        • PipelineRecipeConfig.analysis
        • PipelineRecipeConfig.corpus
        • PipelineRecipeConfig.extraction
        • PipelineRecipeConfig.mirror
        • PipelineRecipeConfig.model_config
        • PipelineRecipeConfig.retrieval
      • PipelineRetrievalConfig
        • PipelineRetrievalConfig.configuration
        • PipelineRetrievalConfig.model_config
        • PipelineRetrievalConfig.retriever
      • QueryBudget
        • QueryBudget.max_items_per_source
        • QueryBudget.max_total_items
        • QueryBudget.maximum_total_characters
        • QueryBudget.model_config
        • QueryBudget.offset
      • RemoteCollectionPullResult
        • RemoteCollectionPullResult.archived
        • RemoteCollectionPullResult.created
        • RemoteCollectionPullResult.discovered
        • RemoteCollectionPullResult.errored
        • RemoteCollectionPullResult.mirrored
        • RemoteCollectionPullResult.model_config
      • RemoteCorpusCollectionConfig
        • RemoteCorpusCollectionConfig.auto_create
        • RemoteCorpusCollectionConfig.collection_name
        • RemoteCorpusCollectionConfig.corpus_root
        • RemoteCorpusCollectionConfig.created_at
        • RemoteCorpusCollectionConfig.deletion_policy
        • RemoteCorpusCollectionConfig.discovery
        • RemoteCorpusCollectionConfig.model_config
        • RemoteCorpusCollectionConfig.schema_version
        • RemoteCorpusCollectionConfig.source
      • RemoteCorpusCollectionDiscovery
        • RemoteCorpusCollectionDiscovery.depth
        • RemoteCorpusCollectionDiscovery.include_root_files
        • RemoteCorpusCollectionDiscovery.mode
        • RemoteCorpusCollectionDiscovery.model_config
      • RemoteCorpusSourceConfig
        • RemoteCorpusSourceConfig.bucket
        • RemoteCorpusSourceConfig.container
        • RemoteCorpusSourceConfig.kind
        • RemoteCorpusSourceConfig.model_config
        • RemoteCorpusSourceConfig.name
        • RemoteCorpusSourceConfig.prefix
        • RemoteCorpusSourceConfig.profile
      • RemoteSourceItem
        • RemoteSourceItem.content_type
        • RemoteSourceItem.etag
        • RemoteSourceItem.key
        • RemoteSourceItem.last_modified
        • RemoteSourceItem.model_config
        • RemoteSourceItem.size
        • RemoteSourceItem.source_uri
      • RemoteSourcePullResult
        • RemoteSourcePullResult.downloaded
        • RemoteSourcePullResult.errored
        • RemoteSourcePullResult.listed
        • RemoteSourcePullResult.model_config
        • RemoteSourcePullResult.pruned
        • RemoteSourcePullResult.skipped
        • RemoteSourcePullResult.updated
      • RetrievalResult
        • RetrievalResult.budget
        • RetrievalResult.configuration_id
        • RetrievalResult.evidence
        • RetrievalResult.generated_at
        • RetrievalResult.model_config
        • RetrievalResult.query_text
        • RetrievalResult.retriever_id
        • RetrievalResult.snapshot_id
        • RetrievalResult.stats
      • RetrievalSnapshot
        • RetrievalSnapshot.catalog_generated_at
        • RetrievalSnapshot.configuration
        • RetrievalSnapshot.corpus_uri
        • RetrievalSnapshot.created_at
        • RetrievalSnapshot.model_config
        • RetrievalSnapshot.snapshot_artifacts
        • RetrievalSnapshot.snapshot_id
        • RetrievalSnapshot.stats
      • parse_extraction_snapshot_reference()
      • apply_budget()
      • create_configuration_manifest()
      • create_snapshot_manifest()
      • hash_text()
      • CharacterBudget
        • CharacterBudget.max_characters
        • CharacterBudget.model_config
      • ContextPack
        • ContextPack.blocks
        • ContextPack.evidence_count
        • ContextPack.model_config
        • ContextPack.text
      • ContextPackBlock
        • ContextPackBlock.evidence_item_id
        • ContextPackBlock.metadata
        • ContextPackBlock.model_config
        • ContextPackBlock.text
      • ContextPackPolicy
        • ContextPackPolicy.include_metadata
        • ContextPackPolicy.join_with
        • ContextPackPolicy.metadata_fields
        • ContextPackPolicy.model_config
        • ContextPackPolicy.ordering
      • TokenBudget
        • TokenBudget.max_tokens
        • TokenBudget.model_config
      • TokenCounter
        • TokenCounter.model_config
        • TokenCounter.tokenizer_id
      • build_context_pack()
      • count_tokens()
      • fit_context_pack_to_character_budget()
      • fit_context_pack_to_token_budget()
      • BenchmarkConfig
        • BenchmarkConfig.aggregate_weights
        • BenchmarkConfig.benchmark_name
        • BenchmarkConfig.categories
        • BenchmarkConfig.load()
        • BenchmarkConfig.output_dir
        • BenchmarkConfig.pipelines
      • BenchmarkReport
        • BenchmarkReport.avg_bigram_overlap
        • BenchmarkReport.avg_f1
        • BenchmarkReport.avg_lcs_ratio
        • BenchmarkReport.avg_precision
        • BenchmarkReport.avg_recall
        • BenchmarkReport.avg_sequence_accuracy
        • BenchmarkReport.avg_trigram_overlap
        • BenchmarkReport.avg_word_error_rate
        • BenchmarkReport.corpus_path
        • BenchmarkReport.evaluation_timestamp
        • BenchmarkReport.max_f1
        • BenchmarkReport.median_f1
        • BenchmarkReport.median_lcs_ratio
        • BenchmarkReport.median_precision
        • BenchmarkReport.median_recall
        • BenchmarkReport.median_sequence_accuracy
        • BenchmarkReport.median_word_error_rate
        • BenchmarkReport.min_f1
        • BenchmarkReport.per_document_results
        • BenchmarkReport.pipeline_configuration
        • BenchmarkReport.print_summary()
        • BenchmarkReport.processing_time_seconds
        • BenchmarkReport.to_csv()
        • BenchmarkReport.to_json()
        • BenchmarkReport.total_documents
      • BenchmarkResult
        • BenchmarkResult.aggregate
        • BenchmarkResult.benchmark_name
        • BenchmarkResult.benchmark_version
        • BenchmarkResult.categories
        • BenchmarkResult.print_summary()
        • BenchmarkResult.recommendations
        • BenchmarkResult.timestamp
        • BenchmarkResult.to_json()
        • BenchmarkResult.to_markdown()
        • BenchmarkResult.total_documents
        • BenchmarkResult.total_processing_time_seconds
      • BenchmarkRunner
        • BenchmarkRunner.run_all()
        • BenchmarkRunner.run_category()
      • CategoryConfig
        • CategoryConfig.corpus_path
        • CategoryConfig.dataset
        • CategoryConfig.ground_truth_subdir
        • CategoryConfig.name
        • CategoryConfig.pipelines
        • CategoryConfig.primary_metric
        • CategoryConfig.subset_size
        • CategoryConfig.tags
      • CategoryResult
        • CategoryResult.best_pipeline
        • CategoryResult.best_score
        • CategoryResult.category_name
        • CategoryResult.dataset
        • CategoryResult.documents_evaluated
        • CategoryResult.pipelines
        • CategoryResult.primary_metric
        • CategoryResult.primary_score
        • CategoryResult.processing_time_seconds
      • OCRBenchmark
        • OCRBenchmark.evaluate_extraction()
      • OCREvaluationResult
        • OCREvaluationResult.bigram_overlap
        • OCREvaluationResult.character_accuracy
        • OCREvaluationResult.document_id
        • OCREvaluationResult.extracted_text
        • OCREvaluationResult.f1_score
        • OCREvaluationResult.false_negatives
        • OCREvaluationResult.false_positives
        • OCREvaluationResult.ground_truth_text
        • OCREvaluationResult.image_path
        • OCREvaluationResult.lcs_ratio
        • OCREvaluationResult.normalized_edit_distance
        • OCREvaluationResult.precision
        • OCREvaluationResult.print_summary()
        • OCREvaluationResult.recall
        • OCREvaluationResult.sequence_accuracy
        • OCREvaluationResult.to_dict()
        • OCREvaluationResult.trigram_overlap
        • OCREvaluationResult.true_positives
        • OCREvaluationResult.word_count_gt
        • OCREvaluationResult.word_count_ocr
        • OCREvaluationResult.word_error_rate
      • calculate_character_accuracy()
      • calculate_ngram_overlap()
      • calculate_word_metrics()
      • calculate_word_order_metrics()
      • evaluate_snapshot()
      • load_dataset()
    • Extraction
      • ExtractionConfigurationManifest
        • ExtractionConfigurationManifest.configuration
        • ExtractionConfigurationManifest.configuration_id
        • ExtractionConfigurationManifest.created_at
        • ExtractionConfigurationManifest.extractor_id
        • ExtractionConfigurationManifest.model_config
        • ExtractionConfigurationManifest.name
      • ExtractionItemResult
        • ExtractionItemResult.error_message
        • ExtractionItemResult.error_type
        • ExtractionItemResult.final_metadata_relpath
        • ExtractionItemResult.final_producer_extractor_id
        • ExtractionItemResult.final_source_stage_index
        • ExtractionItemResult.final_stage_extractor_id
        • ExtractionItemResult.final_stage_index
        • ExtractionItemResult.final_text_relpath
        • ExtractionItemResult.item_id
        • ExtractionItemResult.model_config
        • ExtractionItemResult.stage_results
        • ExtractionItemResult.status
      • ExtractionSnapshotManifest
        • ExtractionSnapshotManifest.catalog_generated_at
        • ExtractionSnapshotManifest.configuration
        • ExtractionSnapshotManifest.corpus_uri
        • ExtractionSnapshotManifest.created_at
        • ExtractionSnapshotManifest.items
        • ExtractionSnapshotManifest.model_config
        • ExtractionSnapshotManifest.snapshot_id
        • ExtractionSnapshotManifest.stats
      • ExtractionStageResult
        • ExtractionStageResult.confidence
        • ExtractionStageResult.error_message
        • ExtractionStageResult.error_type
        • ExtractionStageResult.extractor_id
        • ExtractionStageResult.metadata_relpath
        • ExtractionStageResult.model_config
        • ExtractionStageResult.producer_extractor_id
        • ExtractionStageResult.source_stage_index
        • ExtractionStageResult.stage_index
        • ExtractionStageResult.status
        • ExtractionStageResult.text_characters
        • ExtractionStageResult.text_relpath
      • build_extraction_snapshot()
      • create_extraction_configuration_manifest()
      • create_extraction_snapshot_manifest()
      • load_or_build_extraction_snapshot()
      • write_extracted_metadata_artifact()
      • write_extracted_text_artifact()
      • write_extraction_latest_pointer()
      • write_extraction_snapshot_manifest()
      • write_pipeline_stage_metadata_artifact()
      • write_pipeline_stage_text_artifact()
      • get_extractor()
    • Graph
      • build_graph_snapshot()
      • create_graph_configuration_manifest()
      • create_graph_id()
      • create_graph_snapshot_manifest()
      • latest_graph_snapshot_reference()
      • list_graph_snapshots()
      • load_graph_snapshot_manifest()
      • resolve_graph_snapshot_reference()
      • write_graph_latest_pointer()
      • write_graph_snapshot_manifest()
      • GraphConfigurationManifest
        • GraphConfigurationManifest.configuration
        • GraphConfigurationManifest.configuration_id
        • GraphConfigurationManifest.created_at
        • GraphConfigurationManifest.extractor_id
        • GraphConfigurationManifest.model_config
        • GraphConfigurationManifest.name
      • GraphEdge
        • GraphEdge.dst
        • GraphEdge.edge_id
        • GraphEdge.edge_type
        • GraphEdge.model_config
        • GraphEdge.properties
        • GraphEdge.src
        • GraphEdge.weight
      • GraphExtractionItemSummary
        • GraphExtractionItemSummary.edge_count
        • GraphExtractionItemSummary.error_message
        • GraphExtractionItemSummary.item_id
        • GraphExtractionItemSummary.model_config
        • GraphExtractionItemSummary.node_count
        • GraphExtractionItemSummary.status
      • GraphExtractionResult
        • GraphExtractionResult.edges
        • GraphExtractionResult.item_id
        • GraphExtractionResult.metadata
        • GraphExtractionResult.model_config
        • GraphExtractionResult.nodes
      • GraphNode
        • GraphNode.label
        • GraphNode.model_config
        • GraphNode.node_id
        • GraphNode.node_type
        • GraphNode.properties
      • GraphSchemaModel
        • GraphSchemaModel.model_config
        • GraphSchemaModel.schema_version
      • GraphSnapshotListEntry
        • GraphSnapshotListEntry.catalog_generated_at
        • GraphSnapshotListEntry.configuration_id
        • GraphSnapshotListEntry.configuration_name
        • GraphSnapshotListEntry.created_at
        • GraphSnapshotListEntry.extractor_id
        • GraphSnapshotListEntry.graph_id
        • GraphSnapshotListEntry.model_config
        • GraphSnapshotListEntry.snapshot_id
        • GraphSnapshotListEntry.stats
      • GraphSnapshotManifest
        • GraphSnapshotManifest.catalog_generated_at
        • GraphSnapshotManifest.configuration
        • GraphSnapshotManifest.corpus_uri
        • GraphSnapshotManifest.created_at
        • GraphSnapshotManifest.extraction_snapshot
        • GraphSnapshotManifest.graph_id
        • GraphSnapshotManifest.model_config
        • GraphSnapshotManifest.snapshot_id
        • GraphSnapshotManifest.stats
      • GraphSnapshotReference
        • GraphSnapshotReference.as_string()
        • GraphSnapshotReference.extractor_id
        • GraphSnapshotReference.model_config
        • GraphSnapshotReference.snapshot_id
      • parse_graph_snapshot_reference()
      • Neo4jSettings
        • Neo4jSettings.auto_start
        • Neo4jSettings.bolt_port
        • Neo4jSettings.container_name
        • Neo4jSettings.database
        • Neo4jSettings.docker_image
        • Neo4jSettings.http_port
        • Neo4jSettings.password
        • Neo4jSettings.uri
        • Neo4jSettings.username
      • create_neo4j_driver()
      • ensure_neo4j_running()
      • resolve_neo4j_settings()
      • write_graph_records()
      • available_graph_extractors()
      • get_graph_extractor()
    • Backends
Biblicus
  • Text link
  • View page source

Text link

Text link is a reusable utility for connecting repeated mentions (coreference resolution) without re-emitting the text.

If you ask a model to “return a list of all entity mentions and their canonical IDs,” you face the same hallucination and cost issues as other extraction tasks.

Text link uses the virtual file pattern to handle this in-place. Biblicus asks the model to wrap mentions in XML tags with ID/REF attributes (e.g., <span id="link_1">...</span> and <span ref="link_1">...</span>). The model returns a small edit script, and Biblicus parses it into a graph of connected spans. This lets you resolve entities and structure relationships without regenerating the content.

How text link works

  1. Biblicus loads the full text into memory.

  2. The model receives the text and returns an edit script with str_replace operations.

  3. Biblicus applies the operations and validates id/ref rules.

  4. The marked-up string is parsed into ordered linked spans.

Mechanism example

Biblicus supplies an internal protocol that defines the edit protocol and embeds the current text:

Internal protocol (excerpt):

You are a virtual file editor. Use the available tools to edit the text.
Interpret the word "return" in the user's request as: wrap the returned text with
<span ATTRIBUTE="VALUE">...</span> in-place in the current text.
Each span must include exactly one attribute: id for first mentions and ref for repeats.
Id values must start with "link_".
Current text:
---
Acme launched a product. Later, Acme reported results.
---

Then provide a short user prompt describing what to return:

User prompt:

Link repeated mentions of the same company to the first mention.

The input text is the same content embedded in the internal protocol:

Input text:

Acme launched a product. Later, Acme reported results.

The model edits the virtual file by inserting tags in-place:

Marked-up text:

<span id="link_1">Acme launched a product</span>. Later, <span ref="link_1">Acme reported results</span>.

Biblicus returns structured data parsed from the markup:

Structured data (result):

{
  "marked_up_text": "<span id=\"link_1\">Acme launched a product</span>. Later, <span ref=\"link_1\">Acme reported results</span>.",
  "spans": [
    {
      "index": 1,
      "start_char": 0,
      "end_char": 25,
      "text": "Acme launched a product",
      "attributes": {"id": "link_1"}
    },
    {
      "index": 2,
      "start_char": 33,
      "end_char": 53,
      "text": "Acme reported results",
      "attributes": {"ref": "link_1"}
    }
  ],
  "warnings": []
}

Data model

Text link uses Pydantic models for strict validation:

  • TextLinkRequest: input text + LLM config + prompt template + id prefix.

  • TextLinkResult: marked-up text and linked spans.

Internal protocol templates (advanced overrides) must include {text}. Prompt templates must not include {text} and should only describe what to return. The internal protocol template can interpolate the id prefix via Jinja2.

Most callers only supply the user prompt and text. Override system_prompt only when you need to customize the edit protocol.

Output contract

Text link is tool-driven. The model must use tool calls instead of returning JSON in the assistant message.

Tool call arguments:

str_replace(old_str="Acme launched a product", new_str="<span id=\"link_1\">Acme</span> launched a product")
str_replace(old_str="Acme reported results", new_str="<span ref=\"link_1\">Acme</span> reported results")
done()

Rules:

  • Use the str_replace tool only.

  • Each old_str must match exactly once.

  • Each new_str must be the same text with span tags inserted.

  • Use id for first mentions and ref for repeats.

  • Id values must start with the configured prefix.

  • Id/ref spans must wrap the same repeated text (avoid wrapping extra words).

Long-span handling: the system prompt instructs the model to insert <span> and </span> in separate str_replace calls for long passages (single-call insertion is allowed for short spans). This is covered by unit tests in tests/test_text_utility_tool_calls.py.

Example: Python API

from biblicus.ai.models import AiProvider, LlmClientConfig
from biblicus.text import TextLinkRequest, apply_text_link

request = TextLinkRequest(
    text="Acme launched a product. Later, Acme reported results.",
    client=LlmClientConfig(provider=AiProvider.OPENAI, model="gpt-4o-mini"),
    prompt_template="Link repeated mentions of the same company to the first mention.",
    id_prefix="link_",
)
result = apply_text_link(request)
Previous Next