Adding a Retrieval Backend

Backends are pluggable engines that implement a small, stable interface. The goal is to make new retrieval ideas easy to test without reshaping the corpus.

For user documentation on available backends, see the Backend Reference.

Backend contract

Backends implement two operations:

  • Build run: create a RetrievalRun manifest (and optional artifacts).

  • Query: return structured Evidence objects under a QueryBudget.

Run artifacts

Backends store artifacts and manifests under:

retrieval/<backend_id>/<snapshot_id>/
  manifest.json
  <backend artifacts>

The manifest is the reproducible contract. Artifacts are backend-specific and listed in artifact_paths.

Implementation checklist

  1. Define a Pydantic configuration model for your backend configuration.

  2. Implement RetrievalBackend:

    • build_run(corpus, configuration_name, config)

    • query(corpus, run, query_text, budget)

  3. Emit Evidence with required fields:

    • item_id, source_uri, media_type, score, rank, stage, configuration_id, snapshot_id

    • text or content_ref

  4. Register the backend in biblicus.backends.available_backends.

  5. Add behavior-driven development specifications before implementation and make them pass with 100% coverage.

Design notes

  • Treat runs as immutable manifests with reproducible parameters.

  • If your backend needs artifacts, store them under retrieval/ and record paths in artifact_paths.

  • Keep text extraction in explicit pipeline stages, not in backend ingestion. See docs/extraction.md for how extraction snapshots are built and referenced from backend configs.

Reproducibility checklist

  • Record the extraction snapshot reference used to build the backend.

  • Keep the backend configuration configuration in source control.

  • Reuse the same QueryBudget when comparing backends.

Common pitfalls

  • Returning evidence without text or content_ref.

  • Mutating artifacts after a run is created (breaks reproducibility).

  • Comparing runs built from different extraction outputs.

Examples

See:

  • biblicus.retrievers.scan.ScanRetriever (minimal baseline)

  • biblicus.retrievers.sqlite_full_text_search.SqliteFullTextSearchRetriever (practical local backend)

  • biblicus.retrievers.tf_vector.TfVectorRetriever (term-frequency vector baseline; tf-vector)