OCR Pipeline Benchmarking Guide
Practical how-to guide for benchmarking OCR pipelines in Biblicus using labeled ground truth data.
New to benchmarking? Start with the Benchmarking Overview for a platform introduction, or jump to the Quickstart Guide for step-by-step instructions.
Table of Contents
Reference Documentation:
Pipeline Catalog - All available pipelines with performance data
Metrics Reference - Detailed metric explanations
Current Results - Latest benchmark findings
Overview
This guide provides practical instructions for running OCR benchmarks. It covers:
Setting up benchmark datasets
Running evaluation scripts
Creating custom pipelines
Analyzing results
Troubleshooting common issues
For architectural details and the multi-category framework, see Document Understanding Benchmark Framework.
Key Features:
Multiple evaluation metrics (F1, recall, precision, WER, sequence accuracy)
Support for any extraction pipeline
Per-document and aggregate results
JSON/CSV export for analysis
Comparison between pipeline configurations
Quick Start
# 1. Download the FUNSD benchmark dataset
python scripts/download_funsd_samples.py
# 2. Run benchmark on all built-in pipelines
python scripts/benchmark_all_pipelines.py
# 3. View results
cat results/final_benchmark.json | jq '.pipelines[] | {name, f1: .metrics.set_based.avg_f1}'
For detailed step-by-step instructions, see the Quickstart Guide.
Benchmark Dataset
FUNSD Dataset
What is FUNSD?
Form Understanding in Noisy Scanned Documents
199 annotated scanned form images
31,485 words with ground truth OCR text
Real-world noisy scanned documents (not born-digital)
Download:
python scripts/download_funsd_samples.py
This will:
Download the FUNSD dataset from the official source
Extract 20 test forms into
corpora/funsd_benchmark/Create ground truth files in
metadata/funsd_ground_truth/Tag documents with
["funsd", "scanned", "ground-truth"]
Dataset structure:
corpora/funsd_benchmark/
├── metadata/
│ ├── config.json
│ ├── catalog.json
│ └── funsd_ground_truth/
│ ├── <document-id>.txt # Ground truth OCR text
│ └── ...
├── <document-id>--82092117.png # Scanned form image
└── ...
Using Your Own Dataset
To benchmark on custom documents:
Prepare ground truth files:
corpus_dir/metadata/ground_truth/ ├── <document-id>.txt └── ...
Ingest documents:
from biblicus import Corpus from pathlib import Path corpus = Corpus(Path("my_corpus")) corpus.ingest_file("document.png", tags=["benchmark"])
Run evaluation:
from biblicus.evaluation.ocr_benchmark import OCRBenchmark benchmark = OCRBenchmark(corpus) report = benchmark.evaluate_extraction( snapshot_reference="<snapshot-id>", ground_truth_dir=corpus.root / "metadata" / "ground_truth" )
Available Pipelines
Biblicus includes 8+ pre-configured extraction pipelines. For complete details on each pipeline including:
Performance metrics and trade-offs
Configuration examples
When to use each pipeline
Installation requirements
See the Pipeline Catalog.
Quick summary:
PaddleOCR (F1: 0.787) - Best overall accuracy
Docling-Smol (F1: 0.728) - Tables & formulas
Heron + Tesseract (Recall: 0.810) - Maximum extraction
Baseline Tesseract, RapidOCR, Unstructured, and more
Understanding Metrics
Biblicus uses three categories of metrics:
Set-based: F1, Precision, Recall (word-finding accuracy)
Order-aware: WER, LCS Ratio (reading order quality)
N-gram: Bigram, Trigram overlap (local ordering)
For detailed explanations, interpretations, and use case recommendations, see the Metrics Reference.
Running Benchmarks
Benchmark All Pipelines
python scripts/benchmark_all_pipelines.py
What it does:
Loads all pipeline configs from
configs/Builds extraction snapshots for each pipeline
Evaluates against FUNSD ground truth
Generates comprehensive comparison report
Output:
results/final_benchmark.json- Full results with all metricsConsole output with summary table
Example output:
========================================
FINAL BENCHMARK RESULTS
========================================
| Rank | Pipeline | F1 | Recall | WER | Seq Acc |
|------|-------------------|-------|--------|-------|---------|
| 1 | paddleocr | 0.787 | 0.782 | 0.533 | 0.031 |
| 2 | docling-smol | 0.728 | 0.675 | 0.645 | 0.021 |
| 3 | unstructured | 0.649 | 0.626 | 0.598 | 0.014 |
| 4 | baseline-ocr | 0.607 | 0.599 | 0.628 | 0.013 |
Benchmark Single Pipeline
# Using config file
python scripts/evaluate_ocr_pipeline.py \
--corpus corpora/funsd_benchmark \
--config configs/heron-tesseract.yaml \
--output results/heron_tesseract.json
# Or specify inline
python -c "
from pathlib import Path
from biblicus import Corpus
from biblicus.evaluation.ocr_benchmark import OCRBenchmark
from biblicus.extraction import build_extraction_snapshot
corpus = Corpus(Path('corpora/funsd_benchmark').resolve())
# Build snapshot
config = {
'extractor_id': 'pipeline',
'config': {
'stages': [
{'extractor_id': 'heron-layout', 'config': {'model_variant': '101'}},
{'extractor_id': 'ocr-tesseract', 'config': {'use_layout_metadata': True}}
]
}
}
snapshot = build_extraction_snapshot(
corpus,
extractor_id='pipeline',
configuration_name='heron-tesseract',
configuration=config['config']
)
# Evaluate
benchmark = OCRBenchmark(corpus)
report = benchmark.evaluate_extraction(
snapshot_reference=snapshot.snapshot_id,
pipeline_config=config
)
report.print_summary()
report.to_json(Path('results/heron_tesseract.json'))
"
Compare Two Pipelines
python scripts/compare_pipelines.py \
--baseline configs/baseline-ocr.yaml \
--experimental configs/heron-tesseract.yaml \
--corpus corpora/funsd_benchmark \
--output results/comparison.json
Custom Pipelines
Create Custom Pipeline Config
Create YAML config:
# configs/my-custom-pipeline.yaml
extractor_id: pipeline
config:
stages:
# Step 1: Layout detection (optional)
- extractor_id: heron-layout
config:
model_variant: "101"
confidence_threshold: 0.7
# Step 2: OCR
- extractor_id: ocr-tesseract
config:
use_layout_metadata: true
min_confidence: 0.5
lang: eng
psm: 3
# Step 3: Post-processing (optional)
- extractor_id: clean-gibberish # If implemented
config:
strictness: medium
Add to benchmark script:
# Edit scripts/benchmark_all_pipelines.py
PIPELINE_CONFIGS = [
# ... existing configs ...
"configs/my-custom-pipeline.yaml",
]
Run benchmark:
python scripts/benchmark_all_pipelines.py
Benchmark Custom Extractor
If you’ve created a new extractor:
Register in
__init__.py:
# src/biblicus/extractors/__init__.py
from .my_extractor import MyExtractor
extractors: Dict[str, TextExtractor] = {
# ... existing extractors ...
MyExtractor.extractor_id: MyExtractor(),
}
Create config:
# configs/my-extractor.yaml
extractor_id: my-extractor
config:
param1: value1
param2: value2
Benchmark:
python scripts/evaluate_ocr_pipeline.py \
--corpus corpora/funsd_benchmark \
--config configs/my-extractor.yaml \
--output results/my_extractor.json
Results Analysis
JSON Output Structure
{
"evaluation_timestamp": "2026-02-03T23:00:00Z",
"corpus_path": "/path/to/corpus",
"pipeline_configuration": { ... },
"total_documents": 20,
"aggregate_metrics": {
"avg_precision": 0.625,
"avg_recall": 0.599,
"avg_f1": 0.607,
"median_f1": 0.655,
"avg_word_error_rate": 0.628,
"avg_sequence_accuracy": 0.013,
"avg_bigram_overlap": 0.350
},
"per_document_results": [
{
"document_id": "abc123...",
"image_path": "abc123...png",
"ground_truth_word_count": 134,
"extracted_word_count": 135,
"metrics": {
"precision": 0.615,
"recall": 0.619,
"f1_score": 0.617,
"word_error_rate": 3.056,
"sequence_accuracy": 0.043
}
}
]
}
CSV Output
Per-document results exported to CSV for spreadsheet analysis:
document_id,image_path,gt_word_count,ocr_word_count,precision,recall,f1_score,wer
abc123...,abc123.png,134,135,0.615,0.619,0.617,3.056
Analyzing Results
Find best pipeline:
cat results/final_benchmark.json | jq '.pipelines | sort_by(-.metrics.set_based.avg_f1) | .[0] | {name, f1: .metrics.set_based.avg_f1}'
Compare reading order quality:
cat results/final_benchmark.json | jq '.pipelines[] | {name, seq_acc: .metrics.order_aware.avg_sequence_accuracy}' | sort -t: -k2 -nr
Find documents with low accuracy:
cat results/my_pipeline.json | jq '.per_document_results[] | select(.metrics.f1_score < 0.5) | {doc: .document_id[:16], f1: .metrics.f1_score}'
Export to CSV for Excel:
import json
import pandas as pd
with open('results/final_benchmark.json') as f:
data = json.load(f)
# Create DataFrame
rows = []
for pipeline in data['pipelines']:
rows.append({
'name': pipeline['name'],
'f1': pipeline['metrics']['set_based']['avg_f1'],
'recall': pipeline['metrics']['set_based']['avg_recall'],
'precision': pipeline['metrics']['set_based']['avg_precision'],
'wer': pipeline['metrics']['order_aware']['avg_word_error_rate'],
'seq_acc': pipeline['metrics']['order_aware']['avg_sequence_accuracy'],
})
df = pd.DataFrame(rows)
df.to_csv('results/summary.csv', index=False)
Dependencies
Installing OCR Dependencies
Different pipelines require different dependencies:
Tesseract:
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# Python
pip install pytesseract
PaddleOCR:
pip install "paddleocr" "paddlex[ocr]"
Heron (IBM Research):
pip install "transformers>=4.40.0" "torch>=2.0.0"
Docling:
pip install docling
RapidOCR:
pip install rapidocr-onnxruntime
Unstructured:
pip install "unstructured[image]"
Evaluation dependencies:
pip install editdistance # For WER calculation
Checking Dependencies
# Test Tesseract
tesseract --version
# Test PaddleOCR
python -c "from paddleocr import PPStructureV3; print('PaddleOCR OK')"
# Test Heron
python -c "from transformers import RTDetrV2ForObjectDetection; print('Heron OK')"
# Test Docling
python -c "from docling.document_converter import DocumentConverter; print('Docling OK')"
Troubleshooting
Common Issues
Issue: “Ground truth directory not found”
Solution: Run python scripts/download_funsd_samples.py first
Issue: “No text files found in snapshot”
Solution: Check that extraction succeeded. View snapshot manifest.
Issue: “Model download fails”
Solution: Check internet connection. Models download on first use:
- PaddleOCR: ~100MB
- Heron: ~150MB
- Docling: varies by model
Issue: “Out of memory”
Solution: Use smaller batch sizes or lighter models:
- Heron: Use "base" instead of "101"
- Reduce number of test documents
Issue: “Results don’t match expected performance”
Solution: Check:
- Correct ground truth files loaded
- Document types match pipeline strengths
- Dependencies installed correctly
See Also
Benchmarking Documentation:
Benchmarking Overview - Platform introduction
Quickstart Guide - Step-by-step instructions
Pipeline Catalog - All available pipelines
Metrics Reference - Metric explanations
Current Results - Latest findings
Document Understanding Framework - Architecture details
Heron Implementation - Heron-specific details
Layout-Aware OCR Results - Layout analysis
External References: