# Retrieval Backends Biblicus provides pluggable retrieval backends that implement different search and ranking strategies. Each backend defines how evidence is retrieved from your corpus. ```{toctree} :maxdepth: 1 :caption: Retrieval Backends scan sqlite-full-text-search tf-vector embedding-index-inmemory embedding-index-file ``` ## Available Backends ### [scan](scan.md) Naive full-scan backend that searches all text items at query time without pre-built indexes. - **Backend ID**: `scan` - **Installation**: Included by default - **Best for**: Small corpora, development, baseline comparisons - **Index**: None (scans at query time) - **Speed**: Slow for large corpora ### [sqlite-full-text-search](sqlite-full-text-search.md) Production-ready full-text search using SQLite FTS5 with BM25 ranking. - **Backend ID**: `sqlite-full-text-search` - **Installation**: Included by default (requires SQLite with FTS5 support) - **Best for**: Medium to large corpora, production use - **Index**: SQLite database with FTS5 virtual tables - **Speed**: Fast with persistent index ### [tf-vector](tf-vector.md) Deterministic term-frequency vector retrieval (vector space model baseline) with cosine similarity. - **Backend ID**: `tf-vector` - **Installation**: Included by default - **Best for**: Semantic-style baselines without embeddings - **Index**: None (scans and scores at query time) - **Speed**: Moderate for small corpora ### [embedding-index-inmemory](embedding-index-inmemory.md) Embedding-based retrieval with an in-memory exact cosine similarity index. - **Backend ID**: `embedding-index-inmemory` - **Installation**: Requires `numpy` and an embedding provider configuration - **Best for**: Textbook demos and small corpora - **Index**: In-memory embedding matrix - **Speed**: Fast for small corpora; bounded by safety caps ### [embedding-index-file](embedding-index-file.md) Embedding-based retrieval with a file-backed exact cosine similarity index. - **Backend ID**: `embedding-index-file` - **Installation**: Requires `numpy` and an embedding provider configuration - **Best for**: Larger corpora without running an external vector database - **Index**: Memory-mapped embedding matrix + id mapping under the corpus - **Speed**: Exact scan; bounded memory via batching ## Quick Start ### Installation All backends are included with the base Biblicus installation: ```bash pip install biblicus ``` ### Basic Usage #### Command Line ```bash # Initialize corpus biblicus init my-corpus # Ingest documents biblicus ingest my-corpus document.pdf # Extract text biblicus extract my-corpus --extractor pdf-text # Build retrieval snapshot with a backend biblicus build my-corpus --backend sqlite-full-text-search # Query the run biblicus query my-corpus --query "search terms" ``` See `docs/retrieval.md` for a step-by-step retrieval walkthrough. #### Python API ```python from biblicus import Corpus from biblicus.backends import get_backend # Load corpus corpus = Corpus.from_directory("my-corpus") # Get backend backend = get_backend("sqlite-full-text-search") # Build run run = backend.build_run( corpus, configuration_name="My search index", config={} ) # Query result = backend.query( corpus, run=run, query_text="search terms", budget={"max_total_items": 10} ) ``` See `docs/retrieval-evaluation.md` for evaluation workflows and dataset formats. ## Choosing a Backend | Use Case | Recommended Backend | Notes | |----------|---------------------|-------| | Development & testing | [scan](scan.md) | No index to build, immediate results | | Small corpora (<1000 items) | [scan](scan.md) | Fast enough without indexing overhead | | Production applications | [sqlite-full-text-search](sqlite-full-text-search.md) | Fast queries with BM25 ranking | | Large corpora (>10,000 items) | [sqlite-full-text-search](sqlite-full-text-search.md) | Essential for performance | | Baseline comparisons | [scan](scan.md) | Simple reference implementation | | Term-frequency vector baseline | [tf-vector](tf-vector.md) | Deterministic cosine similarity | | Embedding retrieval (in-memory) | [embedding-index-inmemory](embedding-index-inmemory.md) | Exact cosine similarity | | Embedding retrieval (file-backed) | [embedding-index-file](embedding-index-file.md) | Exact cosine similarity, memory-mapped | ## Reproducibility checklist - Record the extraction snapshot reference used for the backend build. - Keep backend configuration configurations in source control. - Reuse the same `QueryBudget` when comparing backends. ## Common pitfalls - Comparing runs built from different extraction outputs. - Forgetting to persist the snapshot identifier for later evaluation. - Using different budget settings and expecting metrics to be comparable. ## Performance Comparison ### Scan Backend - **Build time**: None (no index) - **Query time**: O(n) - linear scan of all items - **Memory**: Low (no index storage) - **Disk**: None (no artifacts) **Example**: 1000-item corpus, 5-10 second query time ### SQLite Full-Text Search Backend - **Build time**: O(n) - one-time indexing - **Query time**: O(log n) - indexed search - **Memory**: Moderate (SQLite index) - **Disk**: ~1-5 MB per 1000 text items **Example**: 10,000-item corpus, <100ms query time after indexing ## Common Patterns ### Development Workflow Use scan backend during development for immediate feedback: ```bash biblicus build my-corpus --backend scan biblicus query my-corpus --query "test" ``` ### Production Deployment Build a persistent index with sqlite-full-text-search: ```bash biblicus build my-corpus --backend sqlite-full-text-search \ --config chunk_size=800 \ --config chunk_overlap=200 ``` ### Baseline Comparison Compare backends using the same corpus: ```bash # Build with both backends biblicus build my-corpus --backend scan --configuration scan-baseline biblicus build my-corpus --backend sqlite-full-text-search --configuration fts-index # Query both biblicus query my-corpus --run scan:RUN_ID --query "test" biblicus query my-corpus --run sqlite-full-text-search:RUN_ID --query "test" ``` ### Using Extracted Text All backends support extraction snapshots for non-text content: ```bash # Extract text from PDFs biblicus extract my-corpus --extractor pdf-text # Build backend with extraction snapshot biblicus build my-corpus --backend sqlite-full-text-search \ --config extraction_snapshot=pdf-text:EXTRACTION_SNAPSHOT_ID ``` ## Backend Configuration ### Common Configuration Options All backends support these configuration options: | Option | Type | Default | Description | |--------|------|---------|-------------| | `snippet_characters` | int | 400 | Maximum characters in evidence snippets | | `extraction_snapshot` | str | None | Extraction run reference (extractor_id:snapshot_id) | ### Backend-Specific Options #### Scan Backend No additional configuration options. #### SQLite Full-Text Search Backend | Option | Type | Default | Description | |--------|------|---------|-------------| | `chunk_size` | int | 800 | Maximum characters per chunk | | `chunk_overlap` | int | 200 | Overlap characters between chunks | ## Architecture ### Backend Interface All backends implement the `RetrievalBackend` interface: ```python class RetrievalBackend: backend_id: str def build_run(self, corpus, *, configuration_name, config) -> RetrievalRun: """Build a retrieval snapshot (may create artifacts).""" def query(self, corpus, *, run, query_text, budget) -> RetrievalResult: """Query the run and return evidence.""" ``` ### Evidence Model All backends return structured `Evidence` objects: ```python class Evidence: item_id: str # Corpus item identifier source_uri: Optional[str] # Original source URI media_type: str # MIME type score: float # Relevance score rank: int # Result rank text: str # Evidence snippet span_start: Optional[int] # Span start offset span_end: Optional[int] # Span end offset stage: str # Processing stage configuration_id: str # Configuration identifier snapshot_id: str # Run identifier hash: str # Content hash ``` ## Implementing Custom Backends To implement a custom backend: 1. Subclass `RetrievalBackend` 2. Implement `build_run()` and `query()` methods 3. Register in `biblicus.backends.available_backends` 4. Add BDD specifications with 100% coverage See [backends.md](../backends.md) for implementation details. ## See Also - [scan backend](scan.md) - Naive full-scan backend - [sqlite-full-text-search backend](sqlite-full-text-search.md) - SQLite FTS5 backend - [backends.md](../backends.md) - Backend implementation guide - [extraction.md](../extraction.md) - Text extraction pipeline - [Extractor Reference](../extractors/index.md) - Text extraction plugins