Benchmarking Quickstart Guide
Get started with benchmarking Biblicus extraction pipelines in under 10 minutes.
Prerequisites
Biblicus installed:
pip install -e .Python 3.9 or higher
Optional dependencies based on pipelines you want to test:
pip install "biblicus[paddleocr]"for PaddleOCRpip install "biblicus[docling]"for Docling VLMspip install "biblicus[unstructured]"for Unstructured.io
Quick Start (5 minutes)
Step 1: Download Benchmark Dataset
python scripts/download_funsd_samples.py
This downloads 20 FUNSD form images into corpora/funsd_benchmark/ with ground truth text files.
What is FUNSD?
Form Understanding in Noisy Scanned Documents
199 annotated scanned forms
Real-world documents with noise and handwriting
Industry-standard OCR benchmark
Step 2: Run Quick Benchmark
Run a quick benchmark on a subset of pipelines:
python scripts/benchmark_all_pipelines.py \
--corpus-path corpora/funsd_benchmark \
--config configs/benchmark/quick.yaml \
--output results/quick_benchmark.json
What this does:
Tests 8+ extraction pipelines
Evaluates on 20 documents (~5-10 minutes)
Generates comprehensive comparison report
Outputs results to
results/quick_benchmark.json
Step 3: View Results
Console summary:
cat results/quick_benchmark.json | jq '.pipelines[] | {name, f1: .metrics.set_based.avg_f1, recall: .metrics.set_based.avg_recall}'
Full report:
cat results/quick_benchmark.json | jq '.'
Example output:
{
"name": "paddleocr",
"f1": 0.787,
"recall": 0.782
}
{
"name": "docling-smol",
"f1": 0.728,
"recall": 0.675
}
Interpretation:
PaddleOCR has the highest F1 (0.787) → Best overall accuracy
Recall shows how much text each pipeline finds
See Metrics Reference for detailed explanations.
Standard Benchmark (30-60 minutes)
For more thorough validation before release:
python scripts/benchmark_all_pipelines.py \
--corpus-path corpora/funsd_benchmark \
--config configs/benchmark/standard.yaml \
--output results/standard_benchmark.json
Differences from quick:
Tests on 50 forms (vs 20)
More iterations for stable results
Takes 30-60 minutes
Recommended before major releases
Full Benchmark (2-4 hours)
For comprehensive evaluation:
python scripts/benchmark_all_pipelines.py \
--corpus-path corpora/funsd_benchmark \
--config configs/benchmark/full.yaml \
--output results/full_benchmark.json
Differences:
All 199 FUNSD documents
Exhaustive evaluation
Takes 2-4 hours
For research and paper publication
Benchmark Specific Pipelines
Single Pipeline Benchmark
Test just one pipeline:
python scripts/evaluate_ocr_pipeline.py \
--corpus corpora/funsd_benchmark \
--config configs/ocr-paddleocr.yaml \
--output results/paddleocr_results.json
Compare Two Pipelines
Direct comparison between two approaches:
python scripts/benchmark_heron_vs_paddleocr.py \
--corpus corpora/funsd_benchmark \
--output results/heron_vs_paddleocr.json
This script specifically compares:
Heron + Tesseract: Maximum recall, more noise
PaddleOCR: Best F1, balanced accuracy
Quick Layout-Aware Validation
Test layout-aware pipeline quickly:
python scripts/quick_benchmark_layout_aware.py
Validates the layout-aware Tesseract pipeline on a small subset.
Understanding Results
Key Metrics to Check
For Forms (FUNSD):
F1 Score - Overall accuracy (target: ≥0.75)
Recall - Completeness (target: ≥0.70)
WER - Reading order quality (target: ≤0.60)
For Receipts:
F1 Score - Entity extraction accuracy (target: ≥0.80)
Precision - Clean output (target: ≥0.75)
For Academic Papers:
LCS Ratio - Reading order preservation (target: ≥0.75)
Bigram Overlap - Column mixing detection (target: ≥0.60)
See Metrics Reference for detailed explanations.
Result Structure
{
"benchmark_timestamp": "2026-02-13T12:00:00Z",
"corpus_path": "corpora/funsd_benchmark",
"total_pipelines": 8,
"successful_pipelines": 8,
"failed_pipelines": 0,
"pipelines": [
{
"name": "paddleocr",
"snapshot_id": "abc123...",
"success": true,
"metrics": {
"set_based": {
"avg_f1": 0.787,
"avg_precision": 0.792,
"avg_recall": 0.782
},
"order_aware": {
"avg_wer": 0.533,
"avg_sequence_accuracy": 0.031,
"avg_lcs_ratio": 0.621
},
"ngram": {
"avg_bigram_overlap": 0.521,
"avg_trigram_overlap": 0.412
}
},
"total_documents": 20,
"processing_time": 145.3
}
],
"best_performers": {
"best_f1": "paddleocr",
"best_sequence_accuracy": "docling-smol",
"lowest_wer": "paddleocr",
"best_bigram": "heron-tesseract"
}
}
Exporting Results
To CSV for spreadsheet analysis:
from pathlib import Path
from biblicus.evaluation.ocr_benchmark import OCRBenchmark
import json
# Load benchmark results
with open("results/quick_benchmark.json") as f:
results = json.load(f)
# Export to CSV
for pipeline in results["pipelines"]:
name = pipeline["name"]
# Access per-document results if needed
# report.to_csv(Path(f"results/{name}_per_document.csv"))
To JSON for programmatic access:
Results are already in JSON format. Use jq for filtering:
# Get F1 scores only
jq '.pipelines[] | {name, f1: .metrics.set_based.avg_f1}' results/quick_benchmark.json
# Filter by F1 > 0.70
jq '.pipelines[] | select(.metrics.set_based.avg_f1 > 0.70)' results/quick_benchmark.json
# Best performer
jq '.best_performers' results/quick_benchmark.json
Customizing Benchmarks
Create Custom Benchmark Config
Create configs/benchmark/my-benchmark.yaml:
# Benchmark configuration
mode: custom
categories:
- name: forms
corpus_path: corpora/funsd_benchmark
ground_truth_dir: metadata/funsd_ground_truth
primary_metric: f1
document_count: 30 # Subset of documents
pipelines:
- configs/baseline-ocr.yaml
- configs/ocr-paddleocr.yaml
- configs/my-custom-pipeline.yaml
Run with:
python scripts/benchmark_all_pipelines.py \
--config configs/benchmark/my-benchmark.yaml \
--output results/my_benchmark.json
Add Custom Pipeline
Create pipeline config in
configs/my-custom-pipeline.yaml:
extractor_id: pipeline
config:
stages:
- extractor_id: heron-layout
config:
model_variant: "101"
confidence_threshold: 0.7
- extractor_id: ocr-paddleocr-vl
config:
use_layout_metadata: true
lang: en
Test pipeline manually:
from pathlib import Path
from biblicus import Corpus
from biblicus.extraction import build_extraction_snapshot
from biblicus.evaluation.ocr_benchmark import OCRBenchmark
import yaml
# Load config
with open("configs/my-custom-pipeline.yaml") as f:
config = yaml.safe_load(f)
# Build extraction
corpus = Corpus(Path("corpora/funsd_benchmark"))
snapshot = build_extraction_snapshot(
corpus,
extractor_id=config["extractor_id"],
configuration_name="my-custom-pipeline",
configuration=config["config"]
)
# Evaluate
benchmark = OCRBenchmark(corpus)
report = benchmark.evaluate_extraction(
snapshot_reference=snapshot.snapshot_id,
pipeline_config=config
)
# Print results
report.print_summary()
Add to benchmark suite:
Edit scripts/benchmark_all_pipelines.py:
PIPELINE_CONFIGS = [
"configs/baseline-ocr.yaml",
"configs/ocr-paddleocr.yaml",
# ... existing configs ...
"configs/my-custom-pipeline.yaml", # Add yours
]
Benchmarking Custom Datasets
Prepare Your Dataset
Create corpus directory:
mkdir -p my_corpus/metadata/ground_truth
Add documents:
# Copy your images/PDFs
cp /path/to/documents/* my_corpus/
Create ground truth files:
For each document, create a text file with the expected extracted text:
# my_corpus/metadata/ground_truth/<document-id>.txt
echo "Expected text from document" > my_corpus/metadata/ground_truth/doc1.txt
Ingest into corpus:
from pathlib import Path
from biblicus import Corpus
corpus = Corpus.init(Path("my_corpus"))
for doc_path in Path("my_corpus").glob("*.png"):
if not doc_path.stem.startswith("."):
corpus.ingest_file(doc_path, tags=["benchmark"])
Run Benchmark
python scripts/benchmark_all_pipelines.py \
--corpus-path my_corpus \
--ground-truth-dir my_corpus/metadata/ground_truth \
--config configs/benchmark/quick.yaml \
--output results/my_corpus_benchmark.json
Troubleshooting
Common Issues
Error: “FUNSD dataset not found”
# Solution: Download the dataset first
python scripts/download_funsd_samples.py
Error: “Pipeline failed: paddleocr not installed”
# Solution: Install the required extra
pip install "biblicus[paddleocr]"
Error: “Ground truth file not found”
# Solution: Ensure ground truth files match document IDs
ls corpora/funsd_benchmark/metadata/funsd_ground_truth/
Slow benchmark performance:
# Solution: Use quick config for development
python scripts/benchmark_all_pipelines.py \
--config configs/benchmark/quick.yaml # Not standard or full
Debugging Failed Pipelines
Check the detailed error:
import json
with open("results/quick_benchmark.json") as f:
results = json.load(f)
for pipeline in results["pipelines"]:
if not pipeline["success"]:
print(f"Failed: {pipeline['name']}")
print(f"Error: {pipeline.get('error', 'Unknown error')}")
Memory Issues
If benchmarking runs out of memory:
Reduce batch size in benchmark config
Use quick config with fewer documents
Test one pipeline at a time
# Instead of all pipelines:
python scripts/evaluate_ocr_pipeline.py \
--corpus corpora/funsd_benchmark \
--config configs/ocr-paddleocr.yaml
Next Steps
Now that you’ve run your first benchmark:
Understand metrics - Learn what each metric means
Explore pipelines - See all available pipelines and their trade-offs
View current results - Compare your results to latest benchmarks
Deep dive: OCR Benchmarking - Comprehensive guide to OCR evaluation
Deep dive: Multi-Category Framework - Architecture and design
Benchmark Modes Summary
Mode |
Duration |
Documents |
Use Case |
Config |
|---|---|---|---|---|
Quick |
5-10 min |
20 forms |
Development iteration |
|
Standard |
30-60 min |
50 forms |
Release validation |
|
Full |
2-4 hours |
All (199 forms) |
Research/publication |
|
CLI Reference
Benchmark All Pipelines
python scripts/benchmark_all_pipelines.py \
--corpus-path CORPUS_PATH \
--config CONFIG_FILE \
--output OUTPUT_FILE
Evaluate Single Pipeline
python scripts/evaluate_ocr_pipeline.py \
--corpus CORPUS_PATH \
--config PIPELINE_CONFIG \
--output OUTPUT_FILE
Compare Two Pipelines
python scripts/benchmark_heron_vs_paddleocr.py \
--corpus CORPUS_PATH \
--output OUTPUT_FILE
Quick Layout-Aware Test
python scripts/quick_benchmark_layout_aware.py
Resources
Benchmarking Overview - Platform introduction
Pipeline Catalog - All available pipelines
Metrics Reference - Understanding evaluation metrics
Current Results - Latest benchmark findings
Benchmark configs:
configs/benchmark/Scripts:
scripts/benchmark_*.pyResults:
results/