TF Vector backend

The TF Vector backend implements a deterministic vector space model baseline using term-frequency vectors and cosine similarity. It builds no persistent index and scores items at query time. This makes it useful as a lightweight “vector-style” baseline without dense embeddings or external services.

When to use it

You want a minimal baseline to compare against lexical search.
You want deterministic, inspectable similarity scoring.
You are teaching retrieval concepts and want a small, runnable backend.

Backend ID

tf-vector

How it works

Tokenize the query and each item into lowercase word tokens.
Build term-frequency vectors.
Compute cosine similarity between the query vector and each item vector.
Return evidence ranked by similarity score.

Configuration

The backend accepts these configuration fields:

snippet_characters: maximum characters to include in evidence snippets.
extraction_snapshot: optional extraction snapshot reference (extractor_id:snapshot_id).

Example configuration:

snippet_characters: 320
extraction_snapshot: pipeline:RUN_ID

Build a run

python -m biblicus build --corpus corpora/example --backend tf-vector --config extraction_snapshot=pipeline:RUN_ID

This backend does not create artifacts beyond the snapshot manifest.

Query a run

python -m biblicus query --corpus corpora/example --run tf-vector:RUN_ID --query "semantic match"

The evidence results include a stage value of tf-vector and similarity scores for each match.

What it is not

This backend does not compute dense embeddings.
It does not use approximate nearest neighbor indexing.
It does not depend on external services.