Benchmarking Quickstart Guide

Get started with benchmarking Biblicus extraction pipelines in under 10 minutes.

Prerequisites

Biblicus installed: poetry install
Python 3.10 or higher
Optional dependencies based on pipelines you want to test:
- pip install "biblicus[paddleocr]" for PaddleOCR
- pip install "biblicus[docling]" for Docling VLMs
- pip install "biblicus[unstructured]" for Unstructured.io

Quick Start (5 minutes)

Step 1: Download Benchmark Dataset

python scripts/download_funsd_samples.py

This downloads 20 FUNSD form images into corpora/funsd_benchmark/ with ground truth text files.

What is FUNSD?

Form Understanding in Noisy Scanned Documents
199 annotated scanned forms
Real-world documents with noise and handwriting
Industry-standard OCR benchmark

Step 2: Run Quick Benchmark

Run a quick benchmark on a subset of pipelines:

python scripts/benchmark_all_pipelines.py \
  --corpus-path corpora/funsd_benchmark \
  --config configs/benchmark/quick.yaml \
  --output results/quick_benchmark.json

What this does:

Tests 8+ extraction pipelines
Evaluates on 20 documents (~5-10 minutes)
Generates comprehensive comparison report
Outputs results to results/quick_benchmark.json

Step 3: View Results

Console summary:

cat results/quick_benchmark.json | jq '.pipelines[] | {name, f1: .metrics.set_based.avg_f1, recall: .metrics.set_based.avg_recall}'

Full report:

cat results/quick_benchmark.json | jq '.'

Example output:

{
  "name": "paddleocr",
  "f1": 0.787,
  "recall": 0.782
}
{
  "name": "docling-smol",
  "f1": 0.728,
  "recall": 0.675
}

Interpretation:

PaddleOCR has the highest F1 (0.787) → Best overall accuracy
Recall shows how much text each pipeline finds

See Metrics Reference for detailed explanations.

Standard Benchmark (30-60 minutes)

For more thorough validation before release:

python scripts/benchmark_all_pipelines.py \
  --corpus-path corpora/funsd_benchmark \
  --config configs/benchmark/standard.yaml \
  --output results/standard_benchmark.json

Differences from quick:

Tests on 50 forms (vs 20)
More iterations for stable results
Takes 30-60 minutes
Recommended before major releases

Full Benchmark (2-4 hours)

For comprehensive evaluation:

python scripts/benchmark_all_pipelines.py \
  --corpus-path corpora/funsd_benchmark \
  --config configs/benchmark/full.yaml \
  --output results/full_benchmark.json

Differences:

All 199 FUNSD documents
Exhaustive evaluation
Takes 2-4 hours
For research and paper publication

Benchmark Specific Pipelines

Single Pipeline Benchmark

Test just one pipeline:

python scripts/evaluate_ocr_pipeline.py \
  --corpus corpora/funsd_benchmark \
  --config configs/ocr-paddleocr.yaml \
  --output results/paddleocr_results.json

Compare Two Pipelines

Direct comparison between two approaches:

python scripts/benchmark_heron_vs_paddleocr.py \
  --corpus corpora/funsd_benchmark \
  --output results/heron_vs_paddleocr.json

This script specifically compares:

Heron + Tesseract: Maximum recall, more noise
PaddleOCR: Best F1, balanced accuracy

Quick Layout-Aware Validation

Test layout-aware pipeline quickly:

python scripts/quick_benchmark_layout_aware.py

Validates the layout-aware Tesseract pipeline on a small subset.

Understanding Results

Key Metrics to Check

For Forms (FUNSD):

F1 Score - Overall accuracy (target: ≥0.75)
Recall - Completeness (target: ≥0.70)
WER - Reading order quality (target: ≤0.60)

For Receipts:

F1 Score - Entity extraction accuracy (target: ≥0.80)
Precision - Clean output (target: ≥0.75)

For Academic Papers:

LCS Ratio - Reading order preservation (target: ≥0.75)
Bigram Overlap - Column mixing detection (target: ≥0.60)

See Metrics Reference for detailed explanations.

Result Structure

{
  "benchmark_timestamp": "2026-02-13T12:00:00Z",
  "corpus_path": "corpora/funsd_benchmark",
  "total_pipelines": 8,
  "successful_pipelines": 8,
  "failed_pipelines": 0,
  "pipelines": [
    {
      "name": "paddleocr",
      "snapshot_id": "abc123...",
      "success": true,
      "metrics": {
        "set_based": {
          "avg_f1": 0.787,
          "avg_precision": 0.792,
          "avg_recall": 0.782
        },
        "order_aware": {
          "avg_wer": 0.533,
          "avg_sequence_accuracy": 0.031,
          "avg_lcs_ratio": 0.621
        },
        "ngram": {
          "avg_bigram_overlap": 0.521,
          "avg_trigram_overlap": 0.412
        }
      },
      "total_documents": 20,
      "processing_time": 145.3
    }
  ],
  "best_performers": {
    "best_f1": "paddleocr",
    "best_sequence_accuracy": "docling-smol",
    "lowest_wer": "paddleocr",
    "best_bigram": "heron-tesseract"
  }
}

Exporting Results

To CSV for spreadsheet analysis:

from pathlib import Path
from biblicus.evaluation.ocr_benchmark import OCRBenchmark
import json

# Load benchmark results
with open("results/quick_benchmark.json") as f:
    results = json.load(f)

# Export to CSV
for pipeline in results["pipelines"]:
    name = pipeline["name"]
    # Access per-document results if needed
    # report.to_csv(Path(f"results/{name}_per_document.csv"))

To JSON for programmatic access:

Results are already in JSON format. Use jq for filtering:

# Get F1 scores only
jq '.pipelines[] | {name, f1: .metrics.set_based.avg_f1}' results/quick_benchmark.json

# Filter by F1 > 0.70
jq '.pipelines[] | select(.metrics.set_based.avg_f1 > 0.70)' results/quick_benchmark.json

# Best performer
jq '.best_performers' results/quick_benchmark.json

Customizing Benchmarks

Create Custom Benchmark Config

Create configs/benchmark/my-benchmark.yaml:

# Benchmark configuration
mode: custom
categories:
  - name: forms
    corpus_path: corpora/funsd_benchmark
    ground_truth_dir: metadata/funsd_ground_truth
    primary_metric: f1
    document_count: 30  # Subset of documents

pipelines:
  - configs/baseline-ocr.yaml
  - configs/ocr-paddleocr.yaml
  - configs/my-custom-pipeline.yaml

Run with:

python scripts/benchmark_all_pipelines.py \
  --config configs/benchmark/my-benchmark.yaml \
  --output results/my_benchmark.json

Add Custom Pipeline

Create pipeline config in configs/my-custom-pipeline.yaml:

extractor_id: pipeline
config:
  stages:
    - extractor_id: heron-layout
      config:
        model_variant: "101"
        confidence_threshold: 0.7
    - extractor_id: ocr-paddleocr-vl
      config:
        use_layout_metadata: true
        lang: en

Test pipeline manually:

from pathlib import Path
from biblicus import Corpus
from biblicus.extraction import build_extraction_snapshot
from biblicus.evaluation.ocr_benchmark import OCRBenchmark
import yaml

# Load config
with open("configs/my-custom-pipeline.yaml") as f:
    config = yaml.safe_load(f)

# Build extraction
corpus = Corpus(Path("corpora/funsd_benchmark"))
snapshot = build_extraction_snapshot(
    corpus,
    extractor_id=config["extractor_id"],
    configuration_name="my-custom-pipeline",
    configuration=config["config"]
)

# Evaluate
benchmark = OCRBenchmark(corpus)
report = benchmark.evaluate_extraction(
    snapshot_reference=snapshot.snapshot_id,
    pipeline_config=config
)

# Print results
report.print_summary()

Add to benchmark suite:

Edit scripts/benchmark_all_pipelines.py:

PIPELINE_CONFIGS = [
    "configs/baseline-ocr.yaml",
    "configs/ocr-paddleocr.yaml",
    # ... existing configs ...
    "configs/my-custom-pipeline.yaml",  # Add yours
]

Benchmarking Custom Datasets

Prepare Your Dataset

Create corpus directory:

mkdir -p my_corpus/metadata/ground_truth

Add documents:

# Copy your images/PDFs
cp /path/to/documents/* my_corpus/

Create ground truth files:

For each document, create a text file with the expected extracted text:

# my_corpus/metadata/ground_truth/<document-id>.txt
echo "Expected text from document" > my_corpus/metadata/ground_truth/doc1.txt

Ingest into corpus:

from pathlib import Path
from biblicus import Corpus

corpus = Corpus.init(Path("my_corpus"))

for doc_path in Path("my_corpus").glob("*.png"):
    if not doc_path.stem.startswith("."):
        corpus.ingest_file(doc_path, tags=["benchmark"])

Run Benchmark

python scripts/benchmark_all_pipelines.py \
  --corpus-path my_corpus \
  --ground-truth-dir my_corpus/metadata/ground_truth \
  --config configs/benchmark/quick.yaml \
  --output results/my_corpus_benchmark.json

Troubleshooting

Common Issues

Error: “FUNSD dataset not found”

# Solution: Download the dataset first
python scripts/download_funsd_samples.py

Error: “Pipeline failed: paddleocr not installed”

# Solution: Install the required extra
pip install "biblicus[paddleocr]"

Error: “Ground truth file not found”

# Solution: Ensure ground truth files match document IDs
ls corpora/funsd_benchmark/metadata/funsd_ground_truth/

Slow benchmark performance:

# Solution: Use quick config for development
python scripts/benchmark_all_pipelines.py \
  --config configs/benchmark/quick.yaml  # Not standard or full

Debugging Failed Pipelines

Check the detailed error:

import json

with open("results/quick_benchmark.json") as f:
    results = json.load(f)

for pipeline in results["pipelines"]:
    if not pipeline["success"]:
        print(f"Failed: {pipeline['name']}")
        print(f"Error: {pipeline.get('error', 'Unknown error')}")

Memory Issues

If benchmarking runs out of memory:

Reduce batch size in benchmark config
Use quick config with fewer documents
Test one pipeline at a time

# Instead of all pipelines:
python scripts/evaluate_ocr_pipeline.py \
  --corpus corpora/funsd_benchmark \
  --config configs/ocr-paddleocr.yaml

Next Steps

Now that you’ve run your first benchmark:

Understand metrics - Learn what each metric means
Explore pipelines - See all available pipelines and their trade-offs
View current results - Compare your results to latest benchmarks
Deep dive: OCR Benchmarking - Comprehensive guide to OCR evaluation
Deep dive: Multi-Category Framework - Architecture and design

Benchmark Modes Summary

Mode	Duration	Documents	Use Case	Config
Quick	5-10 min	20 forms	Development iteration	`configs/benchmark/quick.yaml`
Standard	30-60 min	50 forms	Release validation	`configs/benchmark/standard.yaml`
Full	2-4 hours	All (199 forms)	Research/publication	`configs/benchmark/full.yaml`

CLI Reference

Benchmark All Pipelines

python scripts/benchmark_all_pipelines.py \
  --corpus-path CORPUS_PATH \
  --config CONFIG_FILE \
  --output OUTPUT_FILE

Evaluate Single Pipeline

python scripts/evaluate_ocr_pipeline.py \
  --corpus CORPUS_PATH \
  --config PIPELINE_CONFIG \
  --output OUTPUT_FILE

Compare Two Pipelines

python scripts/benchmark_heron_vs_paddleocr.py \
  --corpus CORPUS_PATH \
  --output OUTPUT_FILE

Quick Layout-Aware Test

python scripts/quick_benchmark_layout_aware.py

Resources

Benchmarking Overview - Platform introduction
Pipeline Catalog - All available pipelines
Metrics Reference - Understanding evaluation metrics
Current Results - Latest benchmark findings
Benchmark configs: configs/benchmark/
Scripts: scripts/benchmark_*.py
Results: results/