Retrieval Backends

Biblicus provides pluggable retrieval backends that implement different search and ranking strategies. Each backend defines how evidence is retrieved from your corpus.

Available Backends

scan

Naive full-scan backend that searches all text items at query time without pre-built indexes.

  • Backend ID: scan

  • Installation: Included by default

  • Best for: Small corpora, development, baseline comparisons

  • Index: None (scans at query time)

  • Speed: Slow for large corpora

tf-vector

Deterministic term-frequency vector retrieval (vector space model baseline) with cosine similarity.

  • Backend ID: tf-vector

  • Installation: Included by default

  • Best for: Semantic-style baselines without embeddings

  • Index: None (scans and scores at query time)

  • Speed: Moderate for small corpora

embedding-index-inmemory

Embedding-based retrieval with an in-memory exact cosine similarity index.

  • Backend ID: embedding-index-inmemory

  • Installation: Requires numpy and an embedding provider configuration

  • Best for: Textbook demos and small corpora

  • Index: In-memory embedding matrix

  • Speed: Fast for small corpora; bounded by safety caps

embedding-index-file

Embedding-based retrieval with a file-backed exact cosine similarity index.

  • Backend ID: embedding-index-file

  • Installation: Requires numpy and an embedding provider configuration

  • Best for: Larger corpora without running an external vector database

  • Index: Memory-mapped embedding matrix + id mapping under the corpus

  • Speed: Exact scan; bounded memory via batching

Quick Start

Installation

All backends are included with the base Biblicus installation:

pip install biblicus

Basic Usage

Command Line

# Initialize corpus
biblicus init my-corpus

# Ingest documents
biblicus ingest my-corpus document.pdf

# Extract text
biblicus extract my-corpus --extractor pdf-text

# Build retrieval snapshot with a backend
biblicus build my-corpus --backend sqlite-full-text-search

# Query the run
biblicus query my-corpus --query "search terms"

See docs/retrieval.md for a step-by-step retrieval walkthrough.

Python API

from biblicus import Corpus
from biblicus.backends import get_backend

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Get backend
backend = get_backend("sqlite-full-text-search")

# Build run
run = backend.build_run(
    corpus,
    configuration_name="My search index",
    config={}
)

# Query
result = backend.query(
    corpus,
    run=run,
    query_text="search terms",
    budget={"max_total_items": 10}
)

See docs/retrieval-evaluation.md for evaluation workflows and dataset formats.

Choosing a Backend

Use Case

Recommended Backend

Notes

Development & testing

scan

No index to build, immediate results

Small corpora (<1000 items)

scan

Fast enough without indexing overhead

Production applications

sqlite-full-text-search

Fast queries with BM25 ranking

Large corpora (>10,000 items)

sqlite-full-text-search

Essential for performance

Baseline comparisons

scan

Simple reference implementation

Term-frequency vector baseline

tf-vector

Deterministic cosine similarity

Embedding retrieval (in-memory)

embedding-index-inmemory

Exact cosine similarity

Embedding retrieval (file-backed)

embedding-index-file

Exact cosine similarity, memory-mapped

Reproducibility checklist

  • Record the extraction snapshot reference used for the backend build.

  • Keep backend configuration configurations in source control.

  • Reuse the same QueryBudget when comparing backends.

Common pitfalls

  • Comparing runs built from different extraction outputs.

  • Forgetting to persist the snapshot identifier for later evaluation.

  • Using different budget settings and expecting metrics to be comparable.

Performance Comparison

Scan Backend

  • Build time: None (no index)

  • Query time: O(n) - linear scan of all items

  • Memory: Low (no index storage)

  • Disk: None (no artifacts)

Example: 1000-item corpus, 5-10 second query time

SQLite Full-Text Search Backend

  • Build time: O(n) - one-time indexing

  • Query time: O(log n) - indexed search

  • Memory: Moderate (SQLite index)

  • Disk: ~1-5 MB per 1000 text items

Example: 10,000-item corpus, <100ms query time after indexing

Common Patterns

Development Workflow

Use scan backend during development for immediate feedback:

biblicus build my-corpus --backend scan
biblicus query my-corpus --query "test"

Production Deployment

Build a persistent index with sqlite-full-text-search:

biblicus build my-corpus --backend sqlite-full-text-search \
  --config chunk_size=800 \
  --config chunk_overlap=200

Baseline Comparison

Compare backends using the same corpus:

# Build with both backends
biblicus build my-corpus --backend scan --configuration scan-baseline
biblicus build my-corpus --backend sqlite-full-text-search --configuration fts-index

# Query both
biblicus query my-corpus --run scan:RUN_ID --query "test"
biblicus query my-corpus --run sqlite-full-text-search:RUN_ID --query "test"

Using Extracted Text

All backends support extraction snapshots for non-text content:

# Extract text from PDFs
biblicus extract my-corpus --extractor pdf-text

# Build backend with extraction snapshot
biblicus build my-corpus --backend sqlite-full-text-search \
  --config extraction_snapshot=pdf-text:EXTRACTION_SNAPSHOT_ID

Backend Configuration

Common Configuration Options

All backends support these configuration options:

Option

Type

Default

Description

snippet_characters

int

400

Maximum characters in evidence snippets

extraction_snapshot

str

None

Extraction run reference (extractor_id:snapshot_id)

Backend-Specific Options

Scan Backend

No additional configuration options.

SQLite Full-Text Search Backend

Option

Type

Default

Description

chunk_size

int

800

Maximum characters per chunk

chunk_overlap

int

200

Overlap characters between chunks

Architecture

Backend Interface

All backends implement the RetrievalBackend interface:

class RetrievalBackend:
    backend_id: str

    def build_run(self, corpus, *, configuration_name, config) -> RetrievalRun:
        """Build a retrieval snapshot (may create artifacts)."""

    def query(self, corpus, *, run, query_text, budget) -> RetrievalResult:
        """Query the run and return evidence."""

Evidence Model

All backends return structured Evidence objects:

class Evidence:
    item_id: str                  # Corpus item identifier
    source_uri: Optional[str]     # Original source URI
    media_type: str               # MIME type
    score: float                  # Relevance score
    rank: int                     # Result rank
    text: str                     # Evidence snippet
    span_start: Optional[int]     # Span start offset
    span_end: Optional[int]       # Span end offset
    stage: str                    # Processing stage
    configuration_id: str                # Configuration identifier
    snapshot_id: str                   # Run identifier
    hash: str                     # Content hash

Implementing Custom Backends

To implement a custom backend:

  1. Subclass RetrievalBackend

  2. Implement build_run() and query() methods

  3. Register in biblicus.backends.available_backends

  4. Add BDD specifications with 100% coverage

See backends.md for implementation details.

See Also