Biblicus

Tutorials

  • Use Cases
    • Notes to Context Pack
      • Run it
      • What you should see
      • How it works
    • Folder Search With Extraction
      • Run it
      • What you should see
      • How it works
    • Mark Sensitive Text for Redaction
      • Run it (mock mode)
      • Run it (real model)
      • What you should see
    • Sequence Graph With Markov Analysis
      • Run it
      • What you should see
      • How to interpret the output
    • How to run the tutorials

Concepts

  • Text Extraction Pipeline
    • What extraction produces
      • Output structure
    • Reproducibility checklist
    • Available Extractors
      • Text & Document Processing
      • Optical Character Recognition
      • Vision-Language Models
      • Speech-to-Text
      • Pipeline Utilities
    • How selection chooses text
    • Pipeline extractor
    • Complementary versus competing extractors
    • Example: extract from a corpus
    • Example: selection within a pipeline
    • Example: PDF with OCR fallback
    • Example: VLM for complex documents
    • Inspecting and deleting extraction snapshots
    • Common pitfalls
    • Use extracted text in retrieval
    • Evaluate extraction quality
    • What extraction is not
  • Retrieval
    • Retrieval concepts
    • How retrieval snapshots work
    • A minimal run you can execute
    • Backends
    • Choosing a backend
    • Evaluation
    • Evidence inspection workflow
    • Saving evidence for later analysis
    • Labs and demos
    • Reproducibility checklist
    • Why the separation matters
    • Retrieval quality
  • Corpus analysis
    • How analysis snapshots work
    • Analysis snapshot artifacts
    • Inspecting output
    • Comparing analysis snapshots
    • Pluggable analysis backends
    • Choosing an analysis backend
    • Configuration files
    • Topic modeling
    • Markov analysis
    • Profiling analysis
      • Minimal profiling run
  • Graph extraction
    • Where graph extraction sits
    • Why use graph extraction
    • Core concepts
    • Deterministic identifiers
    • Baseline graph extractors
    • Graph storage model
    • Graph extractor interface
    • Running graph extraction
    • Example configurations
    • Simple entities extractor
    • NER entities extractor
    • Dependency relations extractor
    • Querying a logical graph
    • Graph-aware retrieval
    • Local Neo4j setup
    • Repeatable integration script
    • Narrative demo script
    • Narrative demo script (all extractors)
    • Reproducibility checklist
    • Next steps

Core Building Blocks

  • Corpus design
    • What exists today
    • Core vocabulary for this document
    • Day to day corpus workflows
    • Decision points with options and recommendations
    • Locked decisions
      • Decision 1: corpus ignore rules
      • Decision 2: large item ingestion and streaming
      • Decision 3: content aware filename and media type detection
      • Decision 4: folder tree import semantics
      • Decision 5: website crawl scope and safety
      • Decision 6: editorial workflow and reversible pruning
      • Decision 6A: derived artifact storage is partitioned by plugin type
      • Decision 6B: extraction is a separate plugin stage from retrieval
    • Lifecycle hooks and where plugins can attach
      • Hook points to consider
      • Decision 7: hook protocol design
      • Decision 8: how hook execution is recorded
    • Outcomes and remaining questions
      • Hook contexts implemented in version zero
      • Hook log schema implemented in version zero
      • Remaining design questions
    • First behavior driven development slices implemented in version zero
    • Reproducibility checklist
    • Common pitfalls
  • Knowledge base
    • What it does
    • Minimal use
    • Default behavior
    • Output structure
    • Overrides
    • How it relates to lower‑level control
    • Reproducibility checklist
    • Common pitfalls
  • Adding a Retrieval Backend
    • Backend contract
    • Run artifacts
    • Implementation checklist
    • Design notes
    • Reproducibility checklist
    • Common pitfalls
    • Examples
  • Retrieval Backends
    • Scan Backend
      • Overview
      • Installation
      • When to Use
        • Good Use Cases
        • Not Recommended For
      • Configuration
        • Config Schema
        • Configuration Options
      • Usage
        • Command Line
        • Python API
        • With Extraction Runs
      • How It Works
        • Query Processing
        • Scoring Algorithm
        • Snippet Extraction
      • Performance
        • Build Time
        • Query Time
        • Memory Usage
        • Disk Usage
      • Examples
        • Quick Development Search
        • Baseline Comparison
        • Ad-hoc Exploration
      • Limitations
        • Scalability
        • Ranking Quality
        • Query Features
      • When to Upgrade
      • Error Handling
        • Missing Extraction Run
        • Non-Text Items
      • Statistics
      • Related Backends
      • See Also
    • SQLite Full-Text Search Backend
      • Overview
      • Installation
        • Requirements
        • Verify FTS5 Support
      • When to Use
        • Good Use Cases
        • Not Recommended For
      • Configuration
        • Config Schema
        • Configuration Options
        • Chunking Strategy
      • Usage
        • Command Line
        • Python API
        • With Extraction Runs
      • How It Works
        • Index Building
        • Query Processing
        • BM25 Ranking
      • Performance
        • Build Time
        • Query Time
        • Memory Usage
        • Disk Usage
      • Examples
        • Production Deployment
        • Tuned for Large Documents
        • Multi-Format Corpus
        • Query with Context
      • Advanced Configuration
        • SQLite Query Syntax
        • Rebuilding Indexes
      • Limitations
        • Query Features
        • Scalability
        • Ranking
      • When to Upgrade
      • Error Handling
        • FTS5 Not Available
        • Missing Extraction Run
        • Invalid Configuration
      • Statistics
      • Index Artifacts
      • Related Backends
      • See Also
    • TF Vector backend
      • When to use it
      • Backend ID
      • How it works
      • Configuration
      • Build a run
      • Query a run
      • What it is not
    • Embedding index (in-memory)
      • Backend ID
      • What it builds
      • Chunking
      • Dependencies
    • Embedding index (file-backed)
      • Backend ID
      • What it builds
      • Chunking
      • Dependencies
    • Available Backends
      • scan
      • sqlite-full-text-search
      • tf-vector
      • embedding-index-inmemory
      • embedding-index-file
    • Quick Start
      • Installation
      • Basic Usage
        • Command Line
        • Python API
    • Choosing a Backend
    • Reproducibility checklist
    • Common pitfalls
    • Performance Comparison
      • Scan Backend
      • SQLite Full-Text Search Backend
    • Common Patterns
      • Development Workflow
      • Production Deployment
      • Baseline Comparison
      • Using Extracted Text
    • Backend Configuration
      • Common Configuration Options
      • Backend-Specific Options
        • Scan Backend
        • SQLite Full-Text Search Backend
    • Architecture
      • Backend Interface
      • Evidence Model
    • Implementing Custom Backends
    • See Also
  • Context packs
    • Minimal policy
      • Output structure
      • Before and after example
    • Policy surfaces
      • Ordering
      • Metadata inclusion
      • Character budgets
    • Command-line interface
    • Reproducibility checklist
    • What context pack building does
    • Common pitfalls
    • Token budgets
  • Context Engine
    • Why Context Engine?
    • Core Concepts
    • Basic Usage
    • Retriever Packs
    • Expansion and Pagination
    • Compaction Strategies
    • FAQ
      • What does “elastic” mean?
      • How is pagination used?
      • Does this replace Context packs?

Extraction and Ingestion

  • Text Extraction Pipeline
    • What extraction produces
      • Output structure
    • Reproducibility checklist
    • Available Extractors
      • Text & Document Processing
      • Optical Character Recognition
      • Vision-Language Models
      • Speech-to-Text
      • Pipeline Utilities
    • How selection chooses text
    • Pipeline extractor
    • Complementary versus competing extractors
    • Example: extract from a corpus
    • Example: selection within a pipeline
    • Example: PDF with OCR fallback
    • Example: VLM for complex documents
    • Inspecting and deleting extraction snapshots
    • Common pitfalls
    • Use extracted text in retrieval
    • Evaluate extraction quality
    • What extraction is not
  • Extraction evaluation
    • What extraction evaluation measures
    • Dataset format
    • Run extraction evaluation from the CLI
    • Run extraction evaluation from Python
    • Output location
    • Reading the output
    • Working demo
    • Extraction evaluation lab
      • Lab walkthrough
    • Interpretation tips
    • Common pitfalls
  • Text Extractors
    • Text & Document Processing
      • Pass-Through Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Behavior Details
        • Performance
        • Error Handling
        • Use Cases
        • Related Extractors
        • See Also
      • Metadata Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Output Format
        • Behavior Details
        • Performance
        • Error Handling
        • Use Cases
        • Best Practices
        • Related Extractors
        • See Also
      • PDF Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Behavior Details
        • Performance
        • Error Handling
        • Use Cases
        • When to Use PDF Text vs Alternatives
        • Best Practices
        • Related Extractors
        • See Also
      • MarkItDown Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Output Format
        • Performance
        • Error Handling
        • Use Cases
        • When to Use MarkItDown vs Alternatives
        • Best Practices
        • Related Extractors
        • See Also
      • Unstructured Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Output Format
        • Performance
        • Error Handling
        • Use Cases
        • When to Use Unstructured vs Alternatives
        • Best Practices
        • Comparison with Other Extractors
        • Related Extractors
        • See Also
      • Overview
      • Available Extractors
        • pass-through-text
        • metadata-text
        • pdf-text
        • markitdown
        • unstructured
      • Choosing an Extractor
      • Common Patterns
        • Fallback Chain
        • Metadata + Content
      • See Also
    • Optical Character Recognition (OCR)
      • RapidOCR Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Confidence Scores
        • Performance
        • Error Handling
        • Use Cases
        • When to Use RapidOCR vs Alternatives
        • Best Practices
        • Image Quality Tips
        • Related Extractors
        • See Also
      • PaddleOCR-VL Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Inference Backends
        • Language Support
        • Performance
        • Error Handling
        • Use Cases
        • When to Use PaddleOCR-VL vs Alternatives
        • Configuration via User Config
        • Best Practices
        • Related Extractors
        • See Also
      • Biblicus Document Understanding Benchmark
        • Why a Multi-Category Benchmark?
        • Quick Start
        • Document Categories
        • Metrics
        • Scoring Strategy
        • Running Benchmarks
        • Understanding Results
        • Pipeline Recommendations
        • Adding Custom Pipelines
        • Dataset Downloads
        • Licensing
        • See Also
      • OCR Pipeline Benchmarking Guide
        • Table of Contents
        • Overview
        • Quick Start
        • Benchmark Dataset
        • Available Pipelines
        • Understanding Metrics
        • Running Benchmarks
        • Custom Pipelines
        • Results Analysis
        • Dependencies
        • Troubleshooting
        • See Also
      • Biblicus Benchmark Results
        • Executive Summary
        • Forms Category (FUNSD) - Latest Results
        • Receipts Category (SROIE) - Previous Results
        • Pipeline Comparison Summary
        • Recommendations by Use Case
        • Metric Interpretation Guide
        • Benchmark Reproducibility
        • Understanding Pipeline Trade-offs
        • Known Issues and Limitations
        • Future Work
        • See Also
      • Heron Layout Detection Implementation - COMPLETE
        • Summary
        • What is Heron?
        • Implementation
        • workflow-based Complete Workflow Status
        • Benchmark Results
        • Key Findings
        • When to Use Heron
        • Architecture
        • Usage
        • Dependencies
        • Comparison with Original Research
        • Future Work
        • References
        • Conclusion
      • Layout-Aware OCR Implementation Results
        • Implementation
        • Benchmark Results
        • Key Findings
        • Analysis
        • Comparison with PaddleOCR Direct
        • Implementation Status
        • Files Modified/Created
        • Conclusion
      • Overview
      • Available Extractors
        • ocr-rapidocr
        • ocr-paddleocr-vl
      • OCR vs VLM Document Understanding
        • When to Use OCR
        • When to Use VLM
      • Choosing an Extractor
      • Common Patterns
        • Fallback Chain
        • Multi-Strategy Selection
        • Document Type Routing
      • Performance Considerations
        • RapidOCR
        • PaddleOCR VL
        • VLM Alternatives
      • See Also
    • Vision-Language Models (VLM) for Document Understanding
      • SmolDocling-256M Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Performance
        • Error Handling
        • Related Extractors
        • See Also
      • Granite Docling-258M Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • Performance
        • Error Handling
        • Use Cases
        • Related Extractors
        • See Also
      • Overview
      • Available Extractors
        • docling-smol
        • docling-granite
      • VLM vs Traditional OCR
        • Use VLM When:
        • Use Traditional OCR When:
      • Choosing a VLM Extractor
      • Performance Comparison
        • SmolDocling-256M
        • Granite Docling-258M
      • Backend Options
        • MLX Backend (Apple Silicon)
        • Transformers Backend (Cross-Platform)
      • Output Formats
        • Markdown (Default)
        • HTML
        • Plain Text
      • Common Patterns
        • Fallback to OCR
        • Speed vs Accuracy Trade-off
        • Backend Selection
      • Installation Guide
        • Apple Silicon (Recommended)
        • Other Platforms
        • Both Extractors
      • Supported Document Types
      • See Also
    • Speech-to-Text (STT)
      • OpenAI Whisper Speech-to-Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • API Configuration
        • Language Support
        • Performance
        • Error Handling
        • Hallucination Suppression
        • Prompt Guidance
        • Use Cases
        • When to Use OpenAI vs Deepgram
        • Best Practices
        • Related Extractors
        • See Also
      • Deepgram Speech-to-Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Examples
        • API Configuration
        • Language Support
        • Smart Formatting
        • Speaker Diarization
        • Structured Metadata
        • Performance
        • Error Handling
        • Use Cases
        • When to Use Deepgram vs OpenAI
        • Best Practices
        • Advanced Features
        • Related Extractors
        • See Also
      • Aldea Speech-to-Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Authentication
        • Response and Metadata
        • Error Handling
        • See Also
      • Deepgram Transform Extractor
        • Overview
        • Installation
        • Configuration
        • Usage
        • Notes
      • Overview
      • Available Extractors
        • stt-openai
        • stt-deepgram
        • stt-aldea
        • deepgram-transform
      • Choosing an Extractor
      • Performance Comparison
        • OpenAI Whisper
        • Deepgram Nova-3
      • Common Patterns
        • Fallback Chain
        • Language-Specific Routing
        • Speaker Diarization
      • Authentication
        • Environment Variables
        • Configuration File
      • Supported Audio Formats
      • See Also
    • Pipeline Utilities
      • Select Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Selection Rules
        • Usage
        • Examples
        • Behavior Details
        • When to Use Select-Text
        • Best Practices
        • Use Cases
        • Comparison with Other Selectors
        • Related Extractors
        • See Also
      • Select Longest Text Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Selection Rules
        • Usage
        • Examples
        • Behavior Details
        • When to Use Select-Longest-Text
        • Best Practices
        • Use Cases
        • Comparison with Other Selectors
        • Performance Considerations
        • Related Extractors
        • See Also
      • Select Override Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Selection Rules
        • Usage
        • Examples
        • Behavior Details
        • When to Use Select-Override
        • Best Practices
        • Use Cases
        • Comparison with Other Selectors
        • Related Extractors
        • See Also
      • Select Smart Override Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Selection Rules
        • Usage
        • Examples
        • Behavior Details
        • When to Use Select-Smart-Override
        • Best Practices
        • Use Cases
        • Tuning Guidelines
        • Comparison with Other Selectors
        • Related Extractors
        • See Also
      • Pipeline Extractor
        • Overview
        • Installation
        • Supported Media Types
        • Configuration
        • Usage
        • Pipeline Patterns
        • Examples
        • Behavior Details
        • Performance Considerations
        • Best Practices
        • Common Pipeline Recipes
        • Limitations
        • Related Extractors
        • See Also
      • Overview
      • Available Extractors
        • Selection Extractors
        • Override Extractors
        • Composition Extractor
      • Common Patterns
        • PDF Fallback Strategy
        • Media Type Routing
        • Maximum Coverage
        • Metadata + Content
        • Selective Override
      • Decision Tree
        • Which Utility Extractor to Use?
      • Performance Considerations
        • select-text (Short-Circuit)
        • select-longest-text (Parallel)
        • select-override (Conditional)
        • select-smart-override (Pattern-Based)
        • pipeline (Sequential)
      • See Also
    • Extractor Categories
      • Text & Document Processing
      • Optical Character Recognition (OCR)
      • Vision-Language Models (VLM)
      • Speech-to-Text (STT)
      • Pipeline Utilities
    • Quick Start
      • Installation
      • Basic Usage
        • Command Line
        • Python API
    • Choosing an Extractor
      • For PDF Documents
      • For Office Documents
      • For Images
      • For Audio/Video
      • For Multiple Strategies
    • See Also

Retrieval and Evaluation

  • Retrieval
    • Retrieval concepts
    • How retrieval snapshots work
    • A minimal run you can execute
    • Backends
    • Choosing a backend
    • Evaluation
    • Evidence inspection workflow
    • Saving evidence for later analysis
    • Labs and demos
    • Reproducibility checklist
    • Why the separation matters
    • Retrieval quality
  • Retrieval quality upgrades
    • Goals
    • Available upgrades
      • 1) Tuned lexical baseline
      • 2) Reranking stage
      • 3) Hybrid retrieval
    • Evaluation guidance
    • Interpreting evidence signals
    • Budget awareness
    • Evidence tracing checklist
    • Non-goals
    • Summary
  • Retrieval evaluation
    • Dataset format
    • Metrics primer
    • Running an evaluation
    • End-to-end evaluation example
    • Authoring a dataset from a corpus
    • Retrieval evaluation lab
    • Output
    • Reading per-query diagnostics
    • What to record for comparisons
    • Common pitfalls
    • Python usage
    • Design notes
  • Embedding Retrieval
    • Why embedding retrieval?
      • What problem does this solve?
    • Concepts
    • A local, textbook embedding index
    • Build and query
    • Evidence and provenance

Analysis and Modeling

  • Topic modeling
    • What topic modeling does
    • About BERTopic
    • Pipeline stages
    • Run topic modeling from the CLI
    • Output structure
      • Reading a topic record
    • Configuration reference
      • Text source
      • LLM extraction
      • Entity removal
      • Lexical processing
      • BERTopic configuration
      • LLM fine-tuning
    • Vectorizer configuration
    • Repeatable integration script
      • Example: raise topic count
      • Example: disable lexical processing and restrict inputs
      • Example: keep lexical processing but preserve punctuation
    • Tuning workflow
    • Interpreting results
    • Common pitfalls
  • Markov analysis
    • Observation encoder configurability
    • Topic-driven observations
    • What Markov analysis does
    • Text extract segmentation
    • Run Markov analysis from the CLI
      • Cascading configurations and CLI overrides
    • LLM observation cache
    • Output location and artifacts
    • Working demo

Tools

  • Utilities
    • Current utility families
    • Design stance
  • Text utilities
    • The old way: “Just return JSON”
    • The Virtual File Pattern
    • A set of useful patterns
    • The virtual file editor pattern
    • Efficiently Handling Long Spans
      • Mechanism example
    • Prompting Paradigm
    • Validation and confirmation
    • Human-facing utilities
    • Reliable coordination: system prompts and feedback
      • Built-in safeguards
      • Feedback examples (retry stories)
      • Why the safeguards matter
    • Where to go next
  • Entity removal
    • When to use it
    • Installation
    • Configuration
      • Notes
    • Output visibility
  • Text extract
    • How text extract works
      • Mechanism example
    • Data model
    • Output contract
    • Example: Python API
    • Example: Verb markup task
    • Integration examples
      • Extract paragraphs
      • Extract first sentences per paragraph
      • Extract money quotes
      • Extract verbs
      • Extract grouped speaker statements
    • Example: Markov analysis segmentation
    • Validation rules
    • Testing
  • Text slice
    • How text slice works
      • Mechanism example
    • Data model
    • Output contract
    • Example: Python API
    • Integration examples
      • Slice sentences
      • Slice by speaker grouping
    • Validation rules
    • Testing
  • Text annotate
    • How text annotate works
      • Mechanism example
    • Data model
    • Output contract
    • Example: Python API
    • Concept: Text Annotate FAQ
      • What problem does this solve?
      • How is it different from text extract?
      • Why not just use JSON output?
      • Is this a replacement for NER or classification models?
  • Text redact
    • How text redact works
      • Mechanism example
      • Redaction types example
    • Data model
    • Output contract
    • Example: Python API
  • Text link
    • How text link works
      • Mechanism example
    • Data model
    • Output contract
    • Example: Python API

Operations and Demos

  • Demos
    • Working examples you can run now
      • Install for local development
      • Create a corpus and ingest a few items
      • Show an item
      • Edit raw files and reindex
      • Crawl a website prefix
      • Build an extraction snapshot
      • Graph extraction demo
      • Graph extraction integration run
      • Graph extraction baselines
      • Graph extractor narrative demos
      • Graph extractor narrative demos (all extractors)
      • Topic modeling integration run
      • Extraction evaluation demo run
      • Extraction evaluation lab run
      • Retrieval evaluation lab run
      • Profiling analysis demo
      • Select extracted text within a pipeline
      • Portable Document Format extraction and retrieval
      • MarkItDown extraction demo (Python 3.10+)
      • Mixed modality integration corpus
      • Image samples (for optical character recognition experiments)
      • Optional: Unstructured as a last-resort extractor
      • Optional: Speech to text for audio items
      • Build and query the minimal backend
      • Build and query the practical backend
      • Run the test suite and view coverage
    • Documentation map
  • User configuration
    • Where it looks
    • File format
    • Reproducibility notes
    • Example: OpenAI speech to text
    • Example: Deepgram speech to text
    • Example: Aldea speech to text
    • Source profiles (remote collections and corpora)
    • Example: Neo4j graph extraction
    • Common pitfalls

Reference

  • Feature index
    • Corpus
    • Import and ignore rules
    • Streaming ingest
    • Lifecycle hooks
    • User configuration files
    • Text extraction stage
    • Extraction evaluation
    • Graph extraction stage
    • Retrieval backends
    • Evaluation
    • Context packs
    • Context engine
    • Knowledge base
    • Text utilities
    • Text extract
    • Text slice
    • Text annotate
    • Text redact
    • Text link
    • Testing, coverage, and documentation build
    • Integration corpora
  • Roadmap
    • Principles
    • Completed foundations
      • Retrieval evaluation and datasets
      • Retrieval quality upgrades
      • Context pack policy surfaces
      • Extraction evaluation harness
      • Corpus analysis tools
      • Sequence analysis (Markov analysis)
      • Text utilities
    • Next: Tactus integration
    • Later: alternate backends and hosting modes
    • Deferred: corpus and extraction work
      • In-memory corpus for ephemeral workflows
  • Biblicus Architecture
    • Core Concepts
    • Design Principles
    • The Python Developer Mental Model
    • Evidence Lifecycle
    • Relationship to Agent Frameworks
    • Where to go next
  • Application Programming Interface Reference
    • Core
      • Corpus
        • Corpus.analysis_dir
        • Corpus.analysis_run_dir()
        • Corpus.analysis_runs_dir
        • Corpus.catalog_generated_at()
        • Corpus.catalog_path
        • Corpus.create_crawl_id()
        • Corpus.delete_extraction_snapshot()
        • Corpus.extracted_dir
        • Corpus.extraction_snapshot_dir()
        • Corpus.extraction_snapshots_dir
        • Corpus.find()
        • Corpus.get_item()
        • Corpus.graph_dir
        • Corpus.graph_snapshot_dir()
        • Corpus.graph_snapshots_dir
        • Corpus.has_items()
        • Corpus.import_tree()
        • Corpus.ingest_crawled_payload()
        • Corpus.ingest_item()
        • Corpus.ingest_item_stream()
        • Corpus.ingest_note()
        • Corpus.ingest_source()
        • Corpus.init()
        • Corpus.latest_extraction_snapshot_reference()
        • Corpus.latest_snapshot_id
        • Corpus.list_extraction_snapshots()
        • Corpus.list_items()
        • Corpus.load_catalog()
        • Corpus.load_extraction_snapshot_manifest()
        • Corpus.load_snapshot()
        • Corpus.name
        • Corpus.open()
        • Corpus.pull_source()
        • Corpus.purge()
        • Corpus.read_extracted_text()
        • Corpus.reindex()
        • Corpus.retrieval_dir
        • Corpus.snapshots_dir
        • Corpus.uri
        • Corpus.write_snapshot()
      • KnowledgeBase
        • KnowledgeBase.context_pack()
        • KnowledgeBase.corpus
        • KnowledgeBase.defaults
        • KnowledgeBase.from_folder()
        • KnowledgeBase.query()
        • KnowledgeBase.retriever_id
        • KnowledgeBase.snapshot
      • KnowledgeBaseDefaults
        • KnowledgeBaseDefaults.configuration_name
        • KnowledgeBaseDefaults.model_config
        • KnowledgeBaseDefaults.query_budget
        • KnowledgeBaseDefaults.retriever_id
        • KnowledgeBaseDefaults.tags
      • CatalogItem
        • CatalogItem.bytes
        • CatalogItem.created_at
        • CatalogItem.id
        • CatalogItem.media_type
        • CatalogItem.metadata
        • CatalogItem.model_config
        • CatalogItem.relpath
        • CatalogItem.sha256
        • CatalogItem.source_uri
        • CatalogItem.tags
        • CatalogItem.title
      • CollectionMembership
        • CollectionMembership.collection_name
        • CollectionMembership.corpus_name
        • CollectionMembership.model_config
      • ConfigurationManifest
        • ConfigurationManifest.configuration
        • ConfigurationManifest.configuration_id
        • ConfigurationManifest.created_at
        • ConfigurationManifest.description
        • ConfigurationManifest.model_config
        • ConfigurationManifest.name
        • ConfigurationManifest.retriever_id
      • CorpusCatalog
        • CorpusCatalog.corpus_uri
        • CorpusCatalog.generated_at
        • CorpusCatalog.items
        • CorpusCatalog.latest_run_id
        • CorpusCatalog.latest_snapshot_id
        • CorpusCatalog.model_config
        • CorpusCatalog.order
        • CorpusCatalog.raw_dir
        • CorpusCatalog.schema_version
      • CorpusConfig
        • CorpusConfig.collection
        • CorpusConfig.corpus_uri
        • CorpusConfig.created_at
        • CorpusConfig.hooks
        • CorpusConfig.model_config
        • CorpusConfig.notes
        • CorpusConfig.raw_dir
        • CorpusConfig.schema_version
        • CorpusConfig.source
      • Evidence
        • Evidence.configuration_id
        • Evidence.content_ref
        • Evidence.hash
        • Evidence.item_id
        • Evidence.media_type
        • Evidence.metadata
        • Evidence.model_config
        • Evidence.rank
        • Evidence.score
        • Evidence.snapshot_id
        • Evidence.source_uri
        • Evidence.span_end
        • Evidence.span_start
        • Evidence.stage
        • Evidence.stage_scores
        • Evidence.text
      • ExtractedText
        • ExtractedText.confidence
        • ExtractedText.metadata
        • ExtractedText.model_config
        • ExtractedText.producer_extractor_id
        • ExtractedText.source_stage_index
        • ExtractedText.text
      • ExtractionSnapshotListEntry
        • ExtractionSnapshotListEntry.catalog_generated_at
        • ExtractionSnapshotListEntry.configuration_id
        • ExtractionSnapshotListEntry.configuration_name
        • ExtractionSnapshotListEntry.created_at
        • ExtractionSnapshotListEntry.extractor_id
        • ExtractionSnapshotListEntry.model_config
        • ExtractionSnapshotListEntry.snapshot_id
        • ExtractionSnapshotListEntry.stats
      • ExtractionSnapshotReference
        • ExtractionSnapshotReference.as_string()
        • ExtractionSnapshotReference.extractor_id
        • ExtractionSnapshotReference.model_config
        • ExtractionSnapshotReference.snapshot_id
      • ExtractionStageOutput
        • ExtractionStageOutput.confidence
        • ExtractionStageOutput.error_message
        • ExtractionStageOutput.error_type
        • ExtractionStageOutput.extractor_id
        • ExtractionStageOutput.metadata
        • ExtractionStageOutput.model_config
        • ExtractionStageOutput.producer_extractor_id
        • ExtractionStageOutput.source_stage_index
        • ExtractionStageOutput.stage_index
        • ExtractionStageOutput.status
        • ExtractionStageOutput.text
        • ExtractionStageOutput.text_characters
      • IngestResult
        • IngestResult.item_id
        • IngestResult.model_config
        • IngestResult.relpath
        • IngestResult.sha256
      • PipelineAnalysisConfig
        • PipelineAnalysisConfig.configuration
        • PipelineAnalysisConfig.kind
        • PipelineAnalysisConfig.model_config
      • PipelineCorpusSelector
        • PipelineCorpusSelector.collection
        • PipelineCorpusSelector.model_config
        • PipelineCorpusSelector.path
        • PipelineCorpusSelector.selector
      • PipelineExtractionConfig
        • PipelineExtractionConfig.model_config
        • PipelineExtractionConfig.recipe
      • PipelineMirrorConfig
        • PipelineMirrorConfig.collection
        • PipelineMirrorConfig.model_config
      • PipelineRecipeConfig
        • PipelineRecipeConfig.analysis
        • PipelineRecipeConfig.corpus
        • PipelineRecipeConfig.extraction
        • PipelineRecipeConfig.mirror
        • PipelineRecipeConfig.model_config
        • PipelineRecipeConfig.retrieval
      • PipelineRetrievalConfig
        • PipelineRetrievalConfig.configuration
        • PipelineRetrievalConfig.model_config
        • PipelineRetrievalConfig.retriever
      • QueryBudget
        • QueryBudget.max_items_per_source
        • QueryBudget.max_total_items
        • QueryBudget.maximum_total_characters
        • QueryBudget.model_config
        • QueryBudget.offset
      • RemoteCollectionPullResult
        • RemoteCollectionPullResult.archived
        • RemoteCollectionPullResult.created
        • RemoteCollectionPullResult.discovered
        • RemoteCollectionPullResult.errored
        • RemoteCollectionPullResult.mirrored
        • RemoteCollectionPullResult.model_config
      • RemoteCorpusCollectionConfig
        • RemoteCorpusCollectionConfig.auto_create
        • RemoteCorpusCollectionConfig.collection_name
        • RemoteCorpusCollectionConfig.corpus_root
        • RemoteCorpusCollectionConfig.created_at
        • RemoteCorpusCollectionConfig.deletion_policy
        • RemoteCorpusCollectionConfig.discovery
        • RemoteCorpusCollectionConfig.model_config
        • RemoteCorpusCollectionConfig.schema_version
        • RemoteCorpusCollectionConfig.source
      • RemoteCorpusCollectionDiscovery
        • RemoteCorpusCollectionDiscovery.depth
        • RemoteCorpusCollectionDiscovery.include_root_files
        • RemoteCorpusCollectionDiscovery.mode
        • RemoteCorpusCollectionDiscovery.model_config
      • RemoteCorpusSourceConfig
        • RemoteCorpusSourceConfig.bucket
        • RemoteCorpusSourceConfig.container
        • RemoteCorpusSourceConfig.kind
        • RemoteCorpusSourceConfig.model_config
        • RemoteCorpusSourceConfig.name
        • RemoteCorpusSourceConfig.prefix
        • RemoteCorpusSourceConfig.profile
      • RemoteSourceItem
        • RemoteSourceItem.content_type
        • RemoteSourceItem.etag
        • RemoteSourceItem.key
        • RemoteSourceItem.last_modified
        • RemoteSourceItem.model_config
        • RemoteSourceItem.size
        • RemoteSourceItem.source_uri
      • RemoteSourcePullResult
        • RemoteSourcePullResult.downloaded
        • RemoteSourcePullResult.errored
        • RemoteSourcePullResult.listed
        • RemoteSourcePullResult.model_config
        • RemoteSourcePullResult.pruned
        • RemoteSourcePullResult.skipped
        • RemoteSourcePullResult.updated
      • RetrievalResult
        • RetrievalResult.budget
        • RetrievalResult.configuration_id
        • RetrievalResult.evidence
        • RetrievalResult.generated_at
        • RetrievalResult.model_config
        • RetrievalResult.query_text
        • RetrievalResult.retriever_id
        • RetrievalResult.snapshot_id
        • RetrievalResult.stats
      • RetrievalSnapshot
        • RetrievalSnapshot.catalog_generated_at
        • RetrievalSnapshot.configuration
        • RetrievalSnapshot.corpus_uri
        • RetrievalSnapshot.created_at
        • RetrievalSnapshot.model_config
        • RetrievalSnapshot.snapshot_artifacts
        • RetrievalSnapshot.snapshot_id
        • RetrievalSnapshot.stats
      • parse_extraction_snapshot_reference()
      • apply_budget()
      • create_configuration_manifest()
      • create_snapshot_manifest()
      • hash_text()
      • CharacterBudget
        • CharacterBudget.max_characters
        • CharacterBudget.model_config
      • ContextPack
        • ContextPack.blocks
        • ContextPack.evidence_count
        • ContextPack.model_config
        • ContextPack.text
      • ContextPackBlock
        • ContextPackBlock.evidence_item_id
        • ContextPackBlock.metadata
        • ContextPackBlock.model_config
        • ContextPackBlock.text
      • ContextPackPolicy
        • ContextPackPolicy.include_metadata
        • ContextPackPolicy.join_with
        • ContextPackPolicy.metadata_fields
        • ContextPackPolicy.model_config
        • ContextPackPolicy.ordering
      • TokenBudget
        • TokenBudget.max_tokens
        • TokenBudget.model_config
      • TokenCounter
        • TokenCounter.model_config
        • TokenCounter.tokenizer_id
      • build_context_pack()
      • count_tokens()
      • fit_context_pack_to_character_budget()
      • fit_context_pack_to_token_budget()
      • BenchmarkConfig
        • BenchmarkConfig.aggregate_weights
        • BenchmarkConfig.benchmark_name
        • BenchmarkConfig.categories
        • BenchmarkConfig.load()
        • BenchmarkConfig.output_dir
        • BenchmarkConfig.pipelines
      • BenchmarkReport
        • BenchmarkReport.avg_bigram_overlap
        • BenchmarkReport.avg_f1
        • BenchmarkReport.avg_lcs_ratio
        • BenchmarkReport.avg_precision
        • BenchmarkReport.avg_recall
        • BenchmarkReport.avg_sequence_accuracy
        • BenchmarkReport.avg_trigram_overlap
        • BenchmarkReport.avg_word_error_rate
        • BenchmarkReport.corpus_path
        • BenchmarkReport.evaluation_timestamp
        • BenchmarkReport.max_f1
        • BenchmarkReport.median_f1
        • BenchmarkReport.median_lcs_ratio
        • BenchmarkReport.median_precision
        • BenchmarkReport.median_recall
        • BenchmarkReport.median_sequence_accuracy
        • BenchmarkReport.median_word_error_rate
        • BenchmarkReport.min_f1
        • BenchmarkReport.per_document_results
        • BenchmarkReport.pipeline_configuration
        • BenchmarkReport.print_summary()
        • BenchmarkReport.processing_time_seconds
        • BenchmarkReport.to_csv()
        • BenchmarkReport.to_json()
        • BenchmarkReport.total_documents
      • BenchmarkResult
        • BenchmarkResult.aggregate
        • BenchmarkResult.benchmark_name
        • BenchmarkResult.benchmark_version
        • BenchmarkResult.categories
        • BenchmarkResult.print_summary()
        • BenchmarkResult.recommendations
        • BenchmarkResult.timestamp
        • BenchmarkResult.to_json()
        • BenchmarkResult.to_markdown()
        • BenchmarkResult.total_documents
        • BenchmarkResult.total_processing_time_seconds
      • BenchmarkRunner
        • BenchmarkRunner.run_all()
        • BenchmarkRunner.run_category()
      • CategoryConfig
        • CategoryConfig.corpus_path
        • CategoryConfig.dataset
        • CategoryConfig.ground_truth_subdir
        • CategoryConfig.name
        • CategoryConfig.pipelines
        • CategoryConfig.primary_metric
        • CategoryConfig.subset_size
        • CategoryConfig.tags
      • CategoryResult
        • CategoryResult.best_pipeline
        • CategoryResult.best_score
        • CategoryResult.category_name
        • CategoryResult.dataset
        • CategoryResult.documents_evaluated
        • CategoryResult.pipelines
        • CategoryResult.primary_metric
        • CategoryResult.primary_score
        • CategoryResult.processing_time_seconds
      • OCRBenchmark
        • OCRBenchmark.evaluate_extraction()
      • OCREvaluationResult
        • OCREvaluationResult.bigram_overlap
        • OCREvaluationResult.character_accuracy
        • OCREvaluationResult.document_id
        • OCREvaluationResult.extracted_text
        • OCREvaluationResult.f1_score
        • OCREvaluationResult.false_negatives
        • OCREvaluationResult.false_positives
        • OCREvaluationResult.ground_truth_text
        • OCREvaluationResult.image_path
        • OCREvaluationResult.lcs_ratio
        • OCREvaluationResult.normalized_edit_distance
        • OCREvaluationResult.precision
        • OCREvaluationResult.print_summary()
        • OCREvaluationResult.recall
        • OCREvaluationResult.sequence_accuracy
        • OCREvaluationResult.to_dict()
        • OCREvaluationResult.trigram_overlap
        • OCREvaluationResult.true_positives
        • OCREvaluationResult.word_count_gt
        • OCREvaluationResult.word_count_ocr
        • OCREvaluationResult.word_error_rate
      • calculate_character_accuracy()
      • calculate_ngram_overlap()
      • calculate_word_metrics()
      • calculate_word_order_metrics()
      • evaluate_snapshot()
      • load_dataset()
    • Extraction
      • ExtractionConfigurationManifest
        • ExtractionConfigurationManifest.configuration
        • ExtractionConfigurationManifest.configuration_id
        • ExtractionConfigurationManifest.created_at
        • ExtractionConfigurationManifest.extractor_id
        • ExtractionConfigurationManifest.model_config
        • ExtractionConfigurationManifest.name
      • ExtractionItemResult
        • ExtractionItemResult.error_message
        • ExtractionItemResult.error_type
        • ExtractionItemResult.final_metadata_relpath
        • ExtractionItemResult.final_producer_extractor_id
        • ExtractionItemResult.final_source_stage_index
        • ExtractionItemResult.final_stage_extractor_id
        • ExtractionItemResult.final_stage_index
        • ExtractionItemResult.final_text_relpath
        • ExtractionItemResult.item_id
        • ExtractionItemResult.model_config
        • ExtractionItemResult.stage_results
        • ExtractionItemResult.status
      • ExtractionSnapshotManifest
        • ExtractionSnapshotManifest.catalog_generated_at
        • ExtractionSnapshotManifest.configuration
        • ExtractionSnapshotManifest.corpus_uri
        • ExtractionSnapshotManifest.created_at
        • ExtractionSnapshotManifest.items
        • ExtractionSnapshotManifest.model_config
        • ExtractionSnapshotManifest.snapshot_id
        • ExtractionSnapshotManifest.stats
      • ExtractionStageResult
        • ExtractionStageResult.confidence
        • ExtractionStageResult.error_message
        • ExtractionStageResult.error_type
        • ExtractionStageResult.extractor_id
        • ExtractionStageResult.metadata_relpath
        • ExtractionStageResult.model_config
        • ExtractionStageResult.producer_extractor_id
        • ExtractionStageResult.source_stage_index
        • ExtractionStageResult.stage_index
        • ExtractionStageResult.status
        • ExtractionStageResult.text_characters
        • ExtractionStageResult.text_relpath
      • build_extraction_snapshot()
      • create_extraction_configuration_manifest()
      • create_extraction_snapshot_manifest()
      • load_or_build_extraction_snapshot()
      • write_extracted_metadata_artifact()
      • write_extracted_text_artifact()
      • write_extraction_latest_pointer()
      • write_extraction_snapshot_manifest()
      • write_pipeline_stage_metadata_artifact()
      • write_pipeline_stage_text_artifact()
      • get_extractor()
    • Graph
      • build_graph_snapshot()
      • create_graph_configuration_manifest()
      • create_graph_id()
      • create_graph_snapshot_manifest()
      • latest_graph_snapshot_reference()
      • list_graph_snapshots()
      • load_graph_snapshot_manifest()
      • resolve_graph_snapshot_reference()
      • write_graph_latest_pointer()
      • write_graph_snapshot_manifest()
      • GraphConfigurationManifest
        • GraphConfigurationManifest.configuration
        • GraphConfigurationManifest.configuration_id
        • GraphConfigurationManifest.created_at
        • GraphConfigurationManifest.extractor_id
        • GraphConfigurationManifest.model_config
        • GraphConfigurationManifest.name
      • GraphEdge
        • GraphEdge.dst
        • GraphEdge.edge_id
        • GraphEdge.edge_type
        • GraphEdge.model_config
        • GraphEdge.properties
        • GraphEdge.src
        • GraphEdge.weight
      • GraphExtractionItemSummary
        • GraphExtractionItemSummary.edge_count
        • GraphExtractionItemSummary.error_message
        • GraphExtractionItemSummary.item_id
        • GraphExtractionItemSummary.model_config
        • GraphExtractionItemSummary.node_count
        • GraphExtractionItemSummary.status
      • GraphExtractionResult
        • GraphExtractionResult.edges
        • GraphExtractionResult.item_id
        • GraphExtractionResult.metadata
        • GraphExtractionResult.model_config
        • GraphExtractionResult.nodes
      • GraphNode
        • GraphNode.label
        • GraphNode.model_config
        • GraphNode.node_id
        • GraphNode.node_type
        • GraphNode.properties
      • GraphSchemaModel
        • GraphSchemaModel.model_config
        • GraphSchemaModel.schema_version
      • GraphSnapshotListEntry
        • GraphSnapshotListEntry.catalog_generated_at
        • GraphSnapshotListEntry.configuration_id
        • GraphSnapshotListEntry.configuration_name
        • GraphSnapshotListEntry.created_at
        • GraphSnapshotListEntry.extractor_id
        • GraphSnapshotListEntry.graph_id
        • GraphSnapshotListEntry.model_config
        • GraphSnapshotListEntry.snapshot_id
        • GraphSnapshotListEntry.stats
      • GraphSnapshotManifest
        • GraphSnapshotManifest.catalog_generated_at
        • GraphSnapshotManifest.configuration
        • GraphSnapshotManifest.corpus_uri
        • GraphSnapshotManifest.created_at
        • GraphSnapshotManifest.extraction_snapshot
        • GraphSnapshotManifest.graph_id
        • GraphSnapshotManifest.model_config
        • GraphSnapshotManifest.snapshot_id
        • GraphSnapshotManifest.stats
      • GraphSnapshotReference
        • GraphSnapshotReference.as_string()
        • GraphSnapshotReference.extractor_id
        • GraphSnapshotReference.model_config
        • GraphSnapshotReference.snapshot_id
      • parse_graph_snapshot_reference()
      • Neo4jSettings
        • Neo4jSettings.auto_start
        • Neo4jSettings.bolt_port
        • Neo4jSettings.container_name
        • Neo4jSettings.database
        • Neo4jSettings.docker_image
        • Neo4jSettings.http_port
        • Neo4jSettings.password
        • Neo4jSettings.uri
        • Neo4jSettings.username
      • create_neo4j_driver()
      • ensure_neo4j_running()
      • resolve_neo4j_settings()
      • write_graph_records()
      • available_graph_extractors()
      • get_graph_extractor()
    • Backends
Biblicus
  • Search