# OCR Pipeline Benchmarking Guide Practical how-to guide for benchmarking OCR pipelines in Biblicus using labeled ground truth data. > **New to benchmarking?** Start with the [Benchmarking Overview](benchmarking-overview.md) for a platform introduction, or jump to the [Quickstart Guide](quickstart-benchmarking.md) for step-by-step instructions. ## Table of Contents 1. [Overview](#overview) 2. [Quick Start](#quick-start) 3. [Benchmark Dataset](#benchmark-dataset) 4. [Running Benchmarks](#running-benchmarks) 5. [Custom Pipelines](#custom-pipelines) 6. [Results Analysis](#results-analysis) 7. [Dependencies](#dependencies) 8. [Troubleshooting](#troubleshooting) **Reference Documentation:** - [Pipeline Catalog](pipeline-catalog.md) - All available pipelines with performance data - [Metrics Reference](metrics-reference.md) - Detailed metric explanations - [Current Results](benchmark-results.md) - Latest benchmark findings --- ## Overview This guide provides practical instructions for running OCR benchmarks. It covers: - Setting up benchmark datasets - Running evaluation scripts - Creating custom pipelines - Analyzing results - Troubleshooting common issues For architectural details and the multi-category framework, see [Document Understanding Benchmark Framework](document-understanding-benchmark.md). **Key Features:** - Multiple evaluation metrics (F1, recall, precision, WER, sequence accuracy) - Support for any extraction pipeline - Per-document and aggregate results - JSON/CSV export for analysis - Comparison between pipeline configurations --- ## Quick Start ```bash # 1. Download the FUNSD benchmark dataset python scripts/download_funsd_samples.py # 2. Run benchmark on all built-in pipelines python scripts/benchmark_all_pipelines.py # 3. View results cat results/final_benchmark.json | jq '.pipelines[] | {name, f1: .metrics.set_based.avg_f1}' ``` For detailed step-by-step instructions, see the [Quickstart Guide](quickstart-benchmarking.md). --- ## Benchmark Dataset ### FUNSD Dataset **What is FUNSD?** - **F**orm **U**nderstanding in **N**oisy **S**canned **D**ocuments - 199 annotated scanned form images - 31,485 words with ground truth OCR text - Real-world noisy scanned documents (not born-digital) **Download:** ```bash python scripts/download_funsd_samples.py ``` This will: 1. Download the FUNSD dataset from the official source 2. Extract 20 test forms into `corpora/funsd_benchmark/` 3. Create ground truth files in `metadata/funsd_ground_truth/` 4. Tag documents with `["funsd", "scanned", "ground-truth"]` **Dataset structure:** ``` corpora/funsd_benchmark/ ├── metadata/ │ ├── config.json │ ├── catalog.json │ └── funsd_ground_truth/ │ ├── .txt # Ground truth OCR text │ └── ... ├── --82092117.png # Scanned form image └── ... ``` ### Using Your Own Dataset To benchmark on custom documents: 1. **Prepare ground truth files:** ``` corpus_dir/metadata/ground_truth/ ├── .txt └── ... ``` 2. **Ingest documents:** ```python from biblicus import Corpus from pathlib import Path corpus = Corpus(Path("my_corpus")) corpus.ingest_file("document.png", tags=["benchmark"]) ``` 3. **Run evaluation:** ```python from biblicus.evaluation.ocr_benchmark import OCRBenchmark benchmark = OCRBenchmark(corpus) report = benchmark.evaluate_extraction( snapshot_reference="", ground_truth_dir=corpus.root / "metadata" / "ground_truth" ) ``` --- ## Available Pipelines Biblicus includes 8+ pre-configured extraction pipelines. For complete details on each pipeline including: - Performance metrics and trade-offs - Configuration examples - When to use each pipeline - Installation requirements See the **[Pipeline Catalog](pipeline-catalog.md)**. **Quick summary:** - **PaddleOCR** (F1: 0.787) - Best overall accuracy - **Docling-Smol** (F1: 0.728) - Tables & formulas - **Heron + Tesseract** (Recall: 0.810) - Maximum extraction - **Baseline Tesseract**, **RapidOCR**, **Unstructured**, and more --- ## Understanding Metrics Biblicus uses three categories of metrics: - **Set-based**: F1, Precision, Recall (word-finding accuracy) - **Order-aware**: WER, LCS Ratio (reading order quality) - **N-gram**: Bigram, Trigram overlap (local ordering) For detailed explanations, interpretations, and use case recommendations, see the **[Metrics Reference](metrics-reference.md)**. --- ## Running Benchmarks ### Benchmark All Pipelines ```bash python scripts/benchmark_all_pipelines.py ``` **What it does:** 1. Loads all pipeline configs from `configs/` 2. Builds extraction snapshots for each pipeline 3. Evaluates against FUNSD ground truth 4. Generates comprehensive comparison report **Output:** - `results/final_benchmark.json` - Full results with all metrics - Console output with summary table **Example output:** ``` ======================================== FINAL BENCHMARK RESULTS ======================================== | Rank | Pipeline | F1 | Recall | WER | Seq Acc | |------|-------------------|-------|--------|-------|---------| | 1 | paddleocr | 0.787 | 0.782 | 0.533 | 0.031 | | 2 | docling-smol | 0.728 | 0.675 | 0.645 | 0.021 | | 3 | unstructured | 0.649 | 0.626 | 0.598 | 0.014 | | 4 | baseline-ocr | 0.607 | 0.599 | 0.628 | 0.013 | ``` ### Benchmark Single Pipeline ```bash # Using config file python scripts/evaluate_ocr_pipeline.py \ --corpus corpora/funsd_benchmark \ --config configs/heron-tesseract.yaml \ --output results/heron_tesseract.json # Or specify inline python -c " from pathlib import Path from biblicus import Corpus from biblicus.evaluation.ocr_benchmark import OCRBenchmark from biblicus.extraction import build_extraction_snapshot corpus = Corpus(Path('corpora/funsd_benchmark').resolve()) # Build snapshot config = { 'extractor_id': 'pipeline', 'config': { 'stages': [ {'extractor_id': 'heron-layout', 'config': {'model_variant': '101'}}, {'extractor_id': 'ocr-tesseract', 'config': {'use_layout_metadata': True}} ] } } snapshot = build_extraction_snapshot( corpus, extractor_id='pipeline', configuration_name='heron-tesseract', configuration=config['config'] ) # Evaluate benchmark = OCRBenchmark(corpus) report = benchmark.evaluate_extraction( snapshot_reference=snapshot.snapshot_id, pipeline_config=config ) report.print_summary() report.to_json(Path('results/heron_tesseract.json')) " ``` ### Compare Two Pipelines ```bash python scripts/compare_pipelines.py \ --baseline configs/baseline-ocr.yaml \ --experimental configs/heron-tesseract.yaml \ --corpus corpora/funsd_benchmark \ --output results/comparison.json ``` ## Custom Pipelines ### Create Custom Pipeline Config 1. **Create YAML config:** ```yaml # configs/my-custom-pipeline.yaml extractor_id: pipeline config: stages: # Step 1: Layout detection (optional) - extractor_id: heron-layout config: model_variant: "101" confidence_threshold: 0.7 # Step 2: OCR - extractor_id: ocr-tesseract config: use_layout_metadata: true min_confidence: 0.5 lang: eng psm: 3 # Step 3: Post-processing (optional) - extractor_id: clean-gibberish # If implemented config: strictness: medium ``` 2. **Add to benchmark script:** ```python # Edit scripts/benchmark_all_pipelines.py PIPELINE_CONFIGS = [ # ... existing configs ... "configs/my-custom-pipeline.yaml", ] ``` 3. **Run benchmark:** ```bash python scripts/benchmark_all_pipelines.py ``` ### Benchmark Custom Extractor If you've created a new extractor: 1. **Register in `__init__.py`:** ```python # src/biblicus/extractors/__init__.py from .my_extractor import MyExtractor extractors: Dict[str, TextExtractor] = { # ... existing extractors ... MyExtractor.extractor_id: MyExtractor(), } ``` 2. **Create config:** ```yaml # configs/my-extractor.yaml extractor_id: my-extractor config: param1: value1 param2: value2 ``` 3. **Benchmark:** ```bash python scripts/evaluate_ocr_pipeline.py \ --corpus corpora/funsd_benchmark \ --config configs/my-extractor.yaml \ --output results/my_extractor.json ``` --- ## Results Analysis ### JSON Output Structure ```json { "evaluation_timestamp": "2026-02-03T23:00:00Z", "corpus_path": "/path/to/corpus", "pipeline_configuration": { ... }, "total_documents": 20, "aggregate_metrics": { "avg_precision": 0.625, "avg_recall": 0.599, "avg_f1": 0.607, "median_f1": 0.655, "avg_word_error_rate": 0.628, "avg_sequence_accuracy": 0.013, "avg_bigram_overlap": 0.350 }, "per_document_results": [ { "document_id": "abc123...", "image_path": "abc123...png", "ground_truth_word_count": 134, "extracted_word_count": 135, "metrics": { "precision": 0.615, "recall": 0.619, "f1_score": 0.617, "word_error_rate": 3.056, "sequence_accuracy": 0.043 } } ] } ``` ### CSV Output Per-document results exported to CSV for spreadsheet analysis: ```csv document_id,image_path,gt_word_count,ocr_word_count,precision,recall,f1_score,wer abc123...,abc123.png,134,135,0.615,0.619,0.617,3.056 ``` ### Analyzing Results **Find best pipeline:** ```bash cat results/final_benchmark.json | jq '.pipelines | sort_by(-.metrics.set_based.avg_f1) | .[0] | {name, f1: .metrics.set_based.avg_f1}' ``` **Compare reading order quality:** ```bash cat results/final_benchmark.json | jq '.pipelines[] | {name, seq_acc: .metrics.order_aware.avg_sequence_accuracy}' | sort -t: -k2 -nr ``` **Find documents with low accuracy:** ```bash cat results/my_pipeline.json | jq '.per_document_results[] | select(.metrics.f1_score < 0.5) | {doc: .document_id[:16], f1: .metrics.f1_score}' ``` **Export to CSV for Excel:** ```python import json import pandas as pd with open('results/final_benchmark.json') as f: data = json.load(f) # Create DataFrame rows = [] for pipeline in data['pipelines']: rows.append({ 'name': pipeline['name'], 'f1': pipeline['metrics']['set_based']['avg_f1'], 'recall': pipeline['metrics']['set_based']['avg_recall'], 'precision': pipeline['metrics']['set_based']['avg_precision'], 'wer': pipeline['metrics']['order_aware']['avg_word_error_rate'], 'seq_acc': pipeline['metrics']['order_aware']['avg_sequence_accuracy'], }) df = pd.DataFrame(rows) df.to_csv('results/summary.csv', index=False) ``` --- ## Dependencies ### Installing OCR Dependencies Different pipelines require different dependencies: **Tesseract:** ```bash # macOS brew install tesseract # Ubuntu/Debian sudo apt-get install tesseract-ocr # Python pip install pytesseract ``` **PaddleOCR:** ```bash pip install "paddleocr" "paddlex[ocr]" ``` **Heron (IBM Research):** ```bash pip install "transformers>=4.40.0" "torch>=2.0.0" ``` **Docling:** ```bash pip install docling ``` **RapidOCR:** ```bash pip install rapidocr-onnxruntime ``` **Unstructured:** ```bash pip install "unstructured[image]" ``` **Evaluation dependencies:** ```bash pip install editdistance # For WER calculation ``` ### Checking Dependencies ```bash # Test Tesseract tesseract --version # Test PaddleOCR python -c "from paddleocr import PPStructureV3; print('PaddleOCR OK')" # Test Heron python -c "from transformers import RTDetrV2ForObjectDetection; print('Heron OK')" # Test Docling python -c "from docling.document_converter import DocumentConverter; print('Docling OK')" ``` --- ## Troubleshooting ### Common Issues **Issue: "Ground truth directory not found"** ``` Solution: Run python scripts/download_funsd_samples.py first ``` **Issue: "No text files found in snapshot"** ``` Solution: Check that extraction succeeded. View snapshot manifest. ``` **Issue: "Model download fails"** ``` Solution: Check internet connection. Models download on first use: - PaddleOCR: ~100MB - Heron: ~150MB - Docling: varies by model ``` **Issue: "Out of memory"** ``` Solution: Use smaller batch sizes or lighter models: - Heron: Use "base" instead of "101" - Reduce number of test documents ``` **Issue: "Results don't match expected performance"** ``` Solution: Check: - Correct ground truth files loaded - Document types match pipeline strengths - Dependencies installed correctly ``` --- ## See Also **Benchmarking Documentation:** - [Benchmarking Overview](benchmarking-overview.md) - Platform introduction - [Quickstart Guide](quickstart-benchmarking.md) - Step-by-step instructions - [Pipeline Catalog](pipeline-catalog.md) - All available pipelines - [Metrics Reference](metrics-reference.md) - Metric explanations - [Current Results](benchmark-results.md) - Latest findings - [Document Understanding Framework](document-understanding-benchmark.md) - Architecture details - [Heron Implementation](heron-implementation.md) - Heron-specific details - [Layout-Aware OCR Results](layout-aware-ocr-results.md) - Layout analysis **External References:** - [FUNSD Dataset](https://guillaumejaume.github.io/FUNSD/) - [Heron Models (IBM Research)](https://huggingface.co/ds4sd/docling-layout-heron-101) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)