Retrieval evaluation
Biblicus evaluates retrieval snapshots against deterministic datasets so quality comparisons are repeatable across backends and corpora. Evaluations keep the evidence-first model intact by reporting per-query evidence alongside summary metrics.
Dataset format
Retrieval datasets are stored as JavaScript Object Notation files with a strict schema:
{
"schema_version": 1,
"name": "example-dataset",
"description": "Small hand-labeled dataset for smoke tests.",
"queries": [
{
"query_id": "q-001",
"query_text": "alpha",
"expected_item_id": "item-id-123",
"kind": "gold"
}
]
}
Each query includes either an expected_item_id or an expected_source_uri. The kind field records whether the
query is hand-labeled (gold) or synthetic.
Metrics primer
Retrieval evaluation reports a small set of textbook metrics:
Hit rate: the fraction of queries that retrieved the expected item at any rank.
Precision-at-k: hit rate normalized by the evidence budget (
max_total_items).Mean reciprocal rank: the average of
1 / rankfor the first matching item per query.
These metrics are deterministic for the same corpus, run, dataset, and budget.
Running an evaluation
Use the command-line interface to evaluate a retrieval snapshot against a dataset:
biblicus eval --corpus corpora/example --run <snapshot_id> --dataset datasets/retrieval.json \
--max-total-items 5 --maximum-total-characters 2000 --max-items-per-source 5
If --run is omitted, the latest retrieval snapshot is used. Evaluations are deterministic for the same corpus, run, and
budget.
End-to-end evaluation example
This example builds a tiny corpus, creates a retrieval snapshot, and evaluates it against a minimal dataset:
rm -rf corpora/retrieval_eval_demo
python -m biblicus init corpora/retrieval_eval_demo
printf "alpha apple\n" > /tmp/eval-alpha.txt
printf "beta banana\n" > /tmp/eval-beta.txt
python -m biblicus ingest --corpus corpora/retrieval_eval_demo /tmp/eval-alpha.txt
python -m biblicus ingest --corpus corpora/retrieval_eval_demo /tmp/eval-beta.txt
python -m biblicus extract build --corpus corpora/retrieval_eval_demo --stage pass-through-text
python -m biblicus build --corpus corpora/retrieval_eval_demo --backend sqlite-full-text-search
cat > /tmp/retrieval_eval_dataset.json <<'JSON'
{
"schema_version": 1,
"name": "retrieval-eval-demo",
"description": "Minimal dataset for evaluation walkthroughs.",
"queries": [
{
"query_id": "q1",
"query_text": "apple",
"expected_item_id": "ITEM_ID_FOR_ALPHA",
"kind": "gold"
}
]
}
JSON
Replace ITEM_ID_FOR_ALPHA with the item identifier from biblicus list, then run:
python -m biblicus eval --corpus corpora/retrieval_eval_demo --dataset /tmp/retrieval_eval_dataset.json \
--max-total-items 3 --maximum-total-characters 2000 --max-items-per-source 5
Retrieval evaluation lab
The retrieval evaluation lab ships with bundled files and labels so you can run a deterministic end-to-end evaluation without external dependencies.
python scripts/retrieval_evaluation_lab.py --corpus corpora/retrieval_eval_lab --force
The script prints a summary that includes the generated dataset path, the retrieval snapshot identifier, and the evaluation output path.
Output
The evaluation output includes:
Dataset metadata (name, description, query count).
Run metadata (backend ID, run ID, evaluation timestamp).
Metrics (hit rate, precision-at-k, mean reciprocal rank).
System diagnostics (latency percentiles and index size).
The output is JavaScript Object Notation suitable for downstream reporting.
Example snippet:
{
"dataset": {
"name": "retrieval-eval-demo",
"description": "Minimal dataset for evaluation walkthroughs.",
"queries": 1
},
"backend_id": "sqlite-full-text-search",
"snapshot_id": "RUN_ID",
"evaluated_at": "2024-01-01T00:00:00Z",
"metrics": {
"hit_rate": 1.0,
"precision_at_max_total_items": 0.3333333333333333,
"mean_reciprocal_rank": 1.0
},
"system": {
"average_latency_milliseconds": 1.2,
"percentile_95_latency_milliseconds": 2.4,
"index_bytes": 2048.0
}
}
The metrics section is the primary signal for retriever quality. The system section helps compare performance and
storage costs across backends.
Reading per-query diagnostics
When a query misses, inspect the evidence list for that query and compare it to your expectations:
Is the expected item present but ranked too low?
Is the expected item missing because extraction did not produce text?
Did the budget truncate the relevant evidence?
Use this inspection to decide whether to adjust extraction, retrieval configuration, or evaluation datasets.
What to record for comparisons
When you compare retrieval snapshots, capture the same inputs every time:
Corpus path (and whether the catalog has been reindexed).
Extraction snapshot identifier used by the retrieval snapshot.
Retrieval backend identifier and snapshot identifier.
Evaluation dataset path and schema version.
Evidence budget values.
This metadata allows you to rerun the evaluation and explain differences between results.
Common pitfalls
Evaluating against a dataset built for a different corpus or extraction snapshot.
Changing budgets between runs and expecting metrics to be comparable.
Using stale item identifiers after reindexing or re-ingesting content.
Python usage
from pathlib import Path
from biblicus.corpus import Corpus
from biblicus.evaluation import evaluate_run, load_dataset
from biblicus.models import QueryBudget
corpus = Corpus.open("corpora/example")
run = corpus.load_run("<snapshot_id>")
dataset = load_dataset(Path("datasets/retrieval.json"))
budget = QueryBudget(max_total_items=5, maximum_total_characters=2000, max_items_per_source=5)
result = evaluate_run(corpus=corpus, run=run, dataset=dataset, budget=budget)
print(result.model_dump_json(indent=2))
Design notes
Evaluation is reproducible by construction: the snapshot manifest, dataset, and budget fully determine the results.
The evaluation workflow expects retrieval stages to remain explicit in the snapshot artifacts.
Reports are portable, so comparisons across backends and corpora are straightforward.