Retrieval Backends
Biblicus provides pluggable retrieval backends that implement different search and ranking strategies. Each backend defines how evidence is retrieved from your corpus.
Retrieval Backends
Available Backends
scan
Naive full-scan backend that searches all text items at query time without pre-built indexes.
Backend ID:
scanInstallation: Included by default
Best for: Small corpora, development, baseline comparisons
Index: None (scans at query time)
Speed: Slow for large corpora
sqlite-full-text-search
Production-ready full-text search using SQLite FTS5 with BM25 ranking.
Backend ID:
sqlite-full-text-searchInstallation: Included by default (requires SQLite with FTS5 support)
Best for: Medium to large corpora, production use
Index: SQLite database with FTS5 virtual tables
Speed: Fast with persistent index
tf-vector
Deterministic term-frequency vector retrieval (vector space model baseline) with cosine similarity.
Backend ID:
tf-vectorInstallation: Included by default
Best for: Semantic-style baselines without embeddings
Index: None (scans and scores at query time)
Speed: Moderate for small corpora
embedding-index-inmemory
Embedding-based retrieval with an in-memory exact cosine similarity index.
Backend ID:
embedding-index-inmemoryInstallation: Requires
numpyand an embedding provider configurationBest for: Textbook demos and small corpora
Index: In-memory embedding matrix
Speed: Fast for small corpora; bounded by safety caps
embedding-index-file
Embedding-based retrieval with a file-backed exact cosine similarity index.
Backend ID:
embedding-index-fileInstallation: Requires
numpyand an embedding provider configurationBest for: Larger corpora without running an external vector database
Index: Memory-mapped embedding matrix + id mapping under the corpus
Speed: Exact scan; bounded memory via batching
Quick Start
Installation
All backends are included with the base Biblicus installation:
pip install biblicus
Basic Usage
Command Line
# Initialize corpus
biblicus init my-corpus
# Ingest documents
biblicus ingest my-corpus document.pdf
# Extract text
biblicus extract my-corpus --extractor pdf-text
# Build retrieval snapshot with a backend
biblicus build my-corpus --backend sqlite-full-text-search
# Query the run
biblicus query my-corpus --query "search terms"
See docs/retrieval.md for a step-by-step retrieval walkthrough.
Python API
from biblicus import Corpus
from biblicus.backends import get_backend
# Load corpus
corpus = Corpus.from_directory("my-corpus")
# Get backend
backend = get_backend("sqlite-full-text-search")
# Build run
run = backend.build_run(
corpus,
configuration_name="My search index",
config={}
)
# Query
result = backend.query(
corpus,
run=run,
query_text="search terms",
budget={"max_total_items": 10}
)
See docs/retrieval-evaluation.md for evaluation workflows and dataset formats.
Choosing a Backend
Use Case |
Recommended Backend |
Notes |
|---|---|---|
Development & testing |
No index to build, immediate results |
|
Small corpora (<1000 items) |
Fast enough without indexing overhead |
|
Production applications |
Fast queries with BM25 ranking |
|
Large corpora (>10,000 items) |
Essential for performance |
|
Baseline comparisons |
Simple reference implementation |
|
Term-frequency vector baseline |
Deterministic cosine similarity |
|
Embedding retrieval (in-memory) |
Exact cosine similarity |
|
Embedding retrieval (file-backed) |
Exact cosine similarity, memory-mapped |
Reproducibility checklist
Record the extraction snapshot reference used for the backend build.
Keep backend configuration configurations in source control.
Reuse the same
QueryBudgetwhen comparing backends.
Common pitfalls
Comparing runs built from different extraction outputs.
Forgetting to persist the snapshot identifier for later evaluation.
Using different budget settings and expecting metrics to be comparable.
Performance Comparison
Scan Backend
Build time: None (no index)
Query time: O(n) - linear scan of all items
Memory: Low (no index storage)
Disk: None (no artifacts)
Example: 1000-item corpus, 5-10 second query time
SQLite Full-Text Search Backend
Build time: O(n) - one-time indexing
Query time: O(log n) - indexed search
Memory: Moderate (SQLite index)
Disk: ~1-5 MB per 1000 text items
Example: 10,000-item corpus, <100ms query time after indexing
Common Patterns
Development Workflow
Use scan backend during development for immediate feedback:
biblicus build my-corpus --backend scan
biblicus query my-corpus --query "test"
Production Deployment
Build a persistent index with sqlite-full-text-search:
biblicus build my-corpus --backend sqlite-full-text-search \
--config chunk_size=800 \
--config chunk_overlap=200
Baseline Comparison
Compare backends using the same corpus:
# Build with both backends
biblicus build my-corpus --backend scan --configuration scan-baseline
biblicus build my-corpus --backend sqlite-full-text-search --configuration fts-index
# Query both
biblicus query my-corpus --run scan:RUN_ID --query "test"
biblicus query my-corpus --run sqlite-full-text-search:RUN_ID --query "test"
Using Extracted Text
All backends support extraction snapshots for non-text content:
# Extract text from PDFs
biblicus extract my-corpus --extractor pdf-text
# Build backend with extraction snapshot
biblicus build my-corpus --backend sqlite-full-text-search \
--config extraction_snapshot=pdf-text:EXTRACTION_SNAPSHOT_ID
Backend Configuration
Common Configuration Options
All backends support these configuration options:
Option |
Type |
Default |
Description |
|---|---|---|---|
|
int |
400 |
Maximum characters in evidence snippets |
|
str |
None |
Extraction run reference (extractor_id:snapshot_id) |
Backend-Specific Options
Scan Backend
No additional configuration options.
SQLite Full-Text Search Backend
Option |
Type |
Default |
Description |
|---|---|---|---|
|
int |
800 |
Maximum characters per chunk |
|
int |
200 |
Overlap characters between chunks |
Architecture
Backend Interface
All backends implement the RetrievalBackend interface:
class RetrievalBackend:
backend_id: str
def build_run(self, corpus, *, configuration_name, config) -> RetrievalRun:
"""Build a retrieval snapshot (may create artifacts)."""
def query(self, corpus, *, run, query_text, budget) -> RetrievalResult:
"""Query the run and return evidence."""
Evidence Model
All backends return structured Evidence objects:
class Evidence:
item_id: str # Corpus item identifier
source_uri: Optional[str] # Original source URI
media_type: str # MIME type
score: float # Relevance score
rank: int # Result rank
text: str # Evidence snippet
span_start: Optional[int] # Span start offset
span_end: Optional[int] # Span end offset
stage: str # Processing stage
configuration_id: str # Configuration identifier
snapshot_id: str # Run identifier
hash: str # Content hash
Implementing Custom Backends
To implement a custom backend:
Subclass
RetrievalBackendImplement
build_run()andquery()methodsRegister in
biblicus.backends.available_backendsAdd BDD specifications with 100% coverage
See backends.md for implementation details.
See Also
scan backend - Naive full-scan backend
sqlite-full-text-search backend - SQLite FTS5 backend
backends.md - Backend implementation guide
extraction.md - Text extraction pipeline
Extractor Reference - Text extraction plugins