OCR Pipeline Benchmarking Guide

Practical how-to guide for benchmarking OCR pipelines in Biblicus using labeled ground truth data.

New to benchmarking? Start with the Benchmarking Overview for a platform introduction, or jump to the Quickstart Guide for step-by-step instructions.

Table of Contents

  1. Overview

  2. Quick Start

  3. Benchmark Dataset

  4. Running Benchmarks

  5. Custom Pipelines

  6. Results Analysis

  7. Dependencies

  8. Troubleshooting

Reference Documentation:


Overview

This guide provides practical instructions for running OCR benchmarks. It covers:

  • Setting up benchmark datasets

  • Running evaluation scripts

  • Creating custom pipelines

  • Analyzing results

  • Troubleshooting common issues

For architectural details and the multi-category framework, see Document Understanding Benchmark Framework.

Key Features:

  • Multiple evaluation metrics (F1, recall, precision, WER, sequence accuracy)

  • Support for any extraction pipeline

  • Per-document and aggregate results

  • JSON/CSV export for analysis

  • Comparison between pipeline configurations


Quick Start

# 1. Download the FUNSD benchmark dataset
python scripts/download_funsd_samples.py

# 2. Run benchmark on all built-in pipelines
python scripts/benchmark_all_pipelines.py

# 3. View results
cat results/final_benchmark.json | jq '.pipelines[] | {name, f1: .metrics.set_based.avg_f1}'

For detailed step-by-step instructions, see the Quickstart Guide.


Benchmark Dataset

FUNSD Dataset

What is FUNSD?

  • Form Understanding in Noisy Scanned Documents

  • 199 annotated scanned form images

  • 31,485 words with ground truth OCR text

  • Real-world noisy scanned documents (not born-digital)

Download:

python scripts/download_funsd_samples.py

This will:

  1. Download the FUNSD dataset from the official source

  2. Extract 20 test forms into corpora/funsd_benchmark/

  3. Create ground truth files in metadata/funsd_ground_truth/

  4. Tag documents with ["funsd", "scanned", "ground-truth"]

Dataset structure:

corpora/funsd_benchmark/
├── metadata/
│   ├── config.json
│   ├── catalog.json
│   └── funsd_ground_truth/
│       ├── <document-id>.txt                 # Ground truth OCR text
│       └── ...
├── <document-id>--82092117.png               # Scanned form image
└── ...

Using Your Own Dataset

To benchmark on custom documents:

  1. Prepare ground truth files:

    corpus_dir/metadata/ground_truth/
    ├── <document-id>.txt
    └── ...
    
  2. Ingest documents:

    from biblicus import Corpus
    from pathlib import Path
    
    corpus = Corpus(Path("my_corpus"))
    corpus.ingest_file("document.png", tags=["benchmark"])
    
  3. Run evaluation:

    from biblicus.evaluation.ocr_benchmark import OCRBenchmark
    
    benchmark = OCRBenchmark(corpus)
    report = benchmark.evaluate_extraction(
        snapshot_reference="<snapshot-id>",
        ground_truth_dir=corpus.root / "metadata" / "ground_truth"
    )
    

Available Pipelines

Biblicus includes 8+ pre-configured extraction pipelines. For complete details on each pipeline including:

  • Performance metrics and trade-offs

  • Configuration examples

  • When to use each pipeline

  • Installation requirements

See the Pipeline Catalog.

Quick summary:

  • PaddleOCR (F1: 0.787) - Best overall accuracy

  • Docling-Smol (F1: 0.728) - Tables & formulas

  • Heron + Tesseract (Recall: 0.810) - Maximum extraction

  • Baseline Tesseract, RapidOCR, Unstructured, and more


Understanding Metrics

Biblicus uses three categories of metrics:

  • Set-based: F1, Precision, Recall (word-finding accuracy)

  • Order-aware: WER, LCS Ratio (reading order quality)

  • N-gram: Bigram, Trigram overlap (local ordering)

For detailed explanations, interpretations, and use case recommendations, see the Metrics Reference.


Running Benchmarks

Benchmark All Pipelines

python scripts/benchmark_all_pipelines.py

What it does:

  1. Loads all pipeline configs from configs/

  2. Builds extraction snapshots for each pipeline

  3. Evaluates against FUNSD ground truth

  4. Generates comprehensive comparison report

Output:

  • results/final_benchmark.json - Full results with all metrics

  • Console output with summary table

Example output:

========================================
FINAL BENCHMARK RESULTS
========================================

| Rank | Pipeline          | F1    | Recall | WER   | Seq Acc |
|------|-------------------|-------|--------|-------|---------|
| 1    | paddleocr         | 0.787 | 0.782  | 0.533 | 0.031   |
| 2    | docling-smol      | 0.728 | 0.675  | 0.645 | 0.021   |
| 3    | unstructured      | 0.649 | 0.626  | 0.598 | 0.014   |
| 4    | baseline-ocr      | 0.607 | 0.599  | 0.628 | 0.013   |

Benchmark Single Pipeline

# Using config file
python scripts/evaluate_ocr_pipeline.py \
  --corpus corpora/funsd_benchmark \
  --config configs/heron-tesseract.yaml \
  --output results/heron_tesseract.json

# Or specify inline
python -c "
from pathlib import Path
from biblicus import Corpus
from biblicus.evaluation.ocr_benchmark import OCRBenchmark
from biblicus.extraction import build_extraction_snapshot

corpus = Corpus(Path('corpora/funsd_benchmark').resolve())

# Build snapshot
config = {
    'extractor_id': 'pipeline',
    'config': {
        'stages': [
            {'extractor_id': 'heron-layout', 'config': {'model_variant': '101'}},
            {'extractor_id': 'ocr-tesseract', 'config': {'use_layout_metadata': True}}
        ]
    }
}

snapshot = build_extraction_snapshot(
    corpus,
    extractor_id='pipeline',
    configuration_name='heron-tesseract',
    configuration=config['config']
)

# Evaluate
benchmark = OCRBenchmark(corpus)
report = benchmark.evaluate_extraction(
    snapshot_reference=snapshot.snapshot_id,
    pipeline_config=config
)

report.print_summary()
report.to_json(Path('results/heron_tesseract.json'))
"

Compare Two Pipelines

python scripts/compare_pipelines.py \
  --baseline configs/baseline-ocr.yaml \
  --experimental configs/heron-tesseract.yaml \
  --corpus corpora/funsd_benchmark \
  --output results/comparison.json

Custom Pipelines

Create Custom Pipeline Config

  1. Create YAML config:

# configs/my-custom-pipeline.yaml
extractor_id: pipeline
config:
  stages:
    # Step 1: Layout detection (optional)
    - extractor_id: heron-layout
      config:
        model_variant: "101"
        confidence_threshold: 0.7

    # Step 2: OCR
    - extractor_id: ocr-tesseract
      config:
        use_layout_metadata: true
        min_confidence: 0.5
        lang: eng
        psm: 3

    # Step 3: Post-processing (optional)
    - extractor_id: clean-gibberish  # If implemented
      config:
        strictness: medium
  1. Add to benchmark script:

# Edit scripts/benchmark_all_pipelines.py
PIPELINE_CONFIGS = [
    # ... existing configs ...
    "configs/my-custom-pipeline.yaml",
]
  1. Run benchmark:

python scripts/benchmark_all_pipelines.py

Benchmark Custom Extractor

If you’ve created a new extractor:

  1. Register in __init__.py:

# src/biblicus/extractors/__init__.py
from .my_extractor import MyExtractor

extractors: Dict[str, TextExtractor] = {
    # ... existing extractors ...
    MyExtractor.extractor_id: MyExtractor(),
}
  1. Create config:

# configs/my-extractor.yaml
extractor_id: my-extractor
config:
  param1: value1
  param2: value2
  1. Benchmark:

python scripts/evaluate_ocr_pipeline.py \
  --corpus corpora/funsd_benchmark \
  --config configs/my-extractor.yaml \
  --output results/my_extractor.json

Results Analysis

JSON Output Structure

{
  "evaluation_timestamp": "2026-02-03T23:00:00Z",
  "corpus_path": "/path/to/corpus",
  "pipeline_configuration": { ... },
  "total_documents": 20,

  "aggregate_metrics": {
    "avg_precision": 0.625,
    "avg_recall": 0.599,
    "avg_f1": 0.607,
    "median_f1": 0.655,
    "avg_word_error_rate": 0.628,
    "avg_sequence_accuracy": 0.013,
    "avg_bigram_overlap": 0.350
  },

  "per_document_results": [
    {
      "document_id": "abc123...",
      "image_path": "abc123...png",
      "ground_truth_word_count": 134,
      "extracted_word_count": 135,
      "metrics": {
        "precision": 0.615,
        "recall": 0.619,
        "f1_score": 0.617,
        "word_error_rate": 3.056,
        "sequence_accuracy": 0.043
      }
    }
  ]
}

CSV Output

Per-document results exported to CSV for spreadsheet analysis:

document_id,image_path,gt_word_count,ocr_word_count,precision,recall,f1_score,wer
abc123...,abc123.png,134,135,0.615,0.619,0.617,3.056

Analyzing Results

Find best pipeline:

cat results/final_benchmark.json | jq '.pipelines | sort_by(-.metrics.set_based.avg_f1) | .[0] | {name, f1: .metrics.set_based.avg_f1}'

Compare reading order quality:

cat results/final_benchmark.json | jq '.pipelines[] | {name, seq_acc: .metrics.order_aware.avg_sequence_accuracy}' | sort -t: -k2 -nr

Find documents with low accuracy:

cat results/my_pipeline.json | jq '.per_document_results[] | select(.metrics.f1_score < 0.5) | {doc: .document_id[:16], f1: .metrics.f1_score}'

Export to CSV for Excel:

import json
import pandas as pd

with open('results/final_benchmark.json') as f:
    data = json.load(f)

# Create DataFrame
rows = []
for pipeline in data['pipelines']:
    rows.append({
        'name': pipeline['name'],
        'f1': pipeline['metrics']['set_based']['avg_f1'],
        'recall': pipeline['metrics']['set_based']['avg_recall'],
        'precision': pipeline['metrics']['set_based']['avg_precision'],
        'wer': pipeline['metrics']['order_aware']['avg_word_error_rate'],
        'seq_acc': pipeline['metrics']['order_aware']['avg_sequence_accuracy'],
    })

df = pd.DataFrame(rows)
df.to_csv('results/summary.csv', index=False)

Dependencies

Installing OCR Dependencies

Different pipelines require different dependencies:

Tesseract:

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# Python
pip install pytesseract

PaddleOCR:

pip install "paddleocr" "paddlex[ocr]"

Heron (IBM Research):

pip install "transformers>=4.40.0" "torch>=2.0.0"

Docling:

pip install docling

RapidOCR:

pip install rapidocr-onnxruntime

Unstructured:

pip install "unstructured[image]"

Evaluation dependencies:

pip install editdistance  # For WER calculation

Checking Dependencies

# Test Tesseract
tesseract --version

# Test PaddleOCR
python -c "from paddleocr import PPStructureV3; print('PaddleOCR OK')"

# Test Heron
python -c "from transformers import RTDetrV2ForObjectDetection; print('Heron OK')"

# Test Docling
python -c "from docling.document_converter import DocumentConverter; print('Docling OK')"

Troubleshooting

Common Issues

Issue: “Ground truth directory not found”

Solution: Run python scripts/download_funsd_samples.py first

Issue: “No text files found in snapshot”

Solution: Check that extraction succeeded. View snapshot manifest.

Issue: “Model download fails”

Solution: Check internet connection. Models download on first use:
- PaddleOCR: ~100MB
- Heron: ~150MB
- Docling: varies by model

Issue: “Out of memory”

Solution: Use smaller batch sizes or lighter models:
- Heron: Use "base" instead of "101"
- Reduce number of test documents

Issue: “Results don’t match expected performance”

Solution: Check:
- Correct ground truth files loaded
- Document types match pipeline strengths
- Dependencies installed correctly

See Also

Benchmarking Documentation:

External References: