Extraction evaluation
Biblicus provides an extraction evaluation harness that measures how well an extractor configuration turns raw items into text. It is designed to be deterministic, auditable, and useful for selecting a default extraction pipeline.
What extraction evaluation measures
Extraction evaluation reports:
Coverage of extracted text (present, empty, missing)
Accuracy against labeled ground truth text
Processable fraction for each extractor configuration
Optional system metrics such as latency and external cost
The output is structured JSON so you can version it, compare it across runs, and use it in reports.
Dataset format
Extraction evaluation datasets are JSON with a versioned schema. Each entry maps a corpus item to its expected extracted text.
Example:
{
"schema_version": 1,
"name": "Extraction baseline",
"description": "Short labeled texts for extraction accuracy",
"items": [
{
"item_id": "3a2c3f0b-...",
"expected_text": "Hello world",
"kind": "gold"
},
{
"source_uri": "file:///corpora/demo/report.pdf",
"expected_text": "Quarterly results",
"kind": "gold"
}
]
}
Fields:
schema_version: dataset schema version, currently1name: dataset namedescription: optional descriptionitems: list of labeled items with eitheritem_idorsource_uriexpected_text: expected extracted text for the itemkind: label kind, for examplegoldorsynthetic
Run extraction evaluation from the CLI
biblicus extract evaluate --corpus corpora/example \
--run pipeline:EXTRACTION_SNAPSHOT_ID \
--dataset datasets/extraction.json
If you omit --run, Biblicus uses the latest extraction snapshot and emits a reproducibility warning.
Run extraction evaluation from Python
from pathlib import Path
from biblicus.corpus import Corpus
from biblicus.extraction_evaluation import evaluate_extraction_snapshot, load_extraction_dataset
from biblicus.models import ExtractionRunReference
corpus = Corpus.open(Path("corpora/example"))
run = corpus.load_extraction_snapshot("pipeline", "RUN_ID")
dataset = load_extraction_dataset(Path("datasets/extraction.json"))
result = evaluate_extraction_snapshot(corpus=corpus, run=run, dataset=dataset)
print(result.model_dump())
Output location
Extraction evaluation artifacts are stored under:
analysis/evaluation/extraction/<snapshot_id>/output.json
Reading the output
Evaluation outputs include metrics and dataset/run metadata. A shortened example:
{
"dataset": {
"name": "Extraction baseline",
"description": "Short labeled texts for extraction accuracy",
"items": 2
},
"snapshot_id": "pipeline:RUN_ID",
"metrics": {
"coverage_present": 2.0,
"coverage_empty": 0.0,
"coverage_missing": 0.0,
"processable_fraction": 1.0,
"average_similarity": 1.0
}
}
Use coverage_* to understand how much text was produced and average_similarity to compare extraction quality.
Working demo
A runnable demo is provided in scripts/extraction_evaluation_demo.py. It downloads AG News, runs extraction, builds a
dataset from the ingested items, and evaluates the extraction snapshot:
python scripts/extraction_evaluation_demo.py --corpus corpora/ag_news_extraction_eval --force
Extraction evaluation lab
For a fast, fully local walkthrough, use the bundled lab. It ingests a tiny set of files, runs extraction, generates a dataset, and evaluates the run in seconds.
python scripts/extraction_evaluation_lab.py --corpus corpora/extraction_eval_lab --force
The lab uses the bundled files under datasets/extraction_lab/items and writes the generated dataset to
datasets/extraction_lab_output.json by default. The command output includes the evaluation artifact path so you can
inspect the metrics immediately.
Lab walkthrough
Run the lab:
python scripts/extraction_evaluation_lab.py --corpus corpora/extraction_eval_lab --force
Inspect the generated dataset:
cat datasets/extraction_lab_output.json
The dataset is small and deterministic. Each entry maps a corpus item to the expected extracted text.
Inspect the evaluation output:
cat corpora/extraction_eval_lab/analysis/evaluation/extraction/<snapshot_id>/output.json
The output includes:
Coverage counts for present, empty, and missing extracted text.
Processable fraction for the extractor configuration.
Average similarity between expected and extracted text.
Compare metrics to raw items:
The lab includes a Markdown note, a plain text file, and a blank Markdown note. The blank note yields empty extracted text, which should be reflected in the coverage metrics. Because the expected text matches the extracted text for the non-empty items, the similarity score is 1.0 for those items.
Interpretation tips
Use coverage metrics to detect extractors that skip or fail on specific media types.
Use accuracy metrics to compare competing extractors on labeled samples.
Track processable fraction before optimizing quality so you know what fraction of the corpus is actually evaluated.
Common pitfalls
Evaluating a run with a dataset built from a different corpus.
Forgetting to record the extraction snapshot reference in experiment logs.
Comparing runs with different label sets or dataset sizes.