# Biblicus Benchmark Results Current benchmark results and recommendations for choosing extraction pipelines. **Last Updated:** 2026-02-13 **Benchmark Run:** Full benchmark on FUNSD dataset (20 documents) > **Related Documentation:** > - [Benchmarking Overview](benchmarking-overview.md) - Platform introduction > - [Pipeline Catalog](pipeline-catalog.md) - Detailed pipeline information > - [Metrics Reference](metrics-reference.md) - Understanding the metrics > - [Quickstart Guide](quickstart-benchmarking.md) - Run your own benchmarks --- ## Executive Summary This page presents current benchmark results from Biblicus extraction pipelines. The benchmarks evaluate: - **Forms (FUNSD)**: Scanned form documents with noise and handwriting - **Receipts (SROIE)**: Dense receipt text with entity extraction *(previous results)* - **Academic Papers**: Multi-column layouts *(dataset pending)* ### Current Test Environment - **Test Date:** February 13, 2026 - **Dataset:** FUNSD (Form Understanding in Noisy Scanned Documents) - **Documents:** 20 scanned forms - **Pipelines Tested:** 6 (6 successful, 0 failed) ✅ --- ## Forms Category (FUNSD) - Latest Results **Dataset:** FUNSD - 199 scanned form documents with word-level ground truth **Challenge:** Noisy scans, handwriting, field extraction **Primary Metric:** F1 Score ### Results (20 documents - Feb 13, 2026) | Rank | Pipeline | F1 Score | Precision | Recall | WER | Bigram | Status | |------|----------|----------|-----------|--------|-----|--------|--------| | 1 | **PaddleOCR** | **0.787** ⭐ | ~0.792 | 0.782 | **0.533** | **0.466** | ✓ | | 2 | **Docling-Smol** | 0.728 | 0.821 | 0.675 | 0.645 | 0.430 | ✓ | | 3 | **Docling-Granite** | 0.728 | 0.821 | 0.675 | 0.645 | 0.430 | ✓ | | 4 | **Unstructured** | 0.631 | ~0.673 | 0.597 | 0.608 | 0.368 | ✓ | | 5 | **Baseline Tesseract** | 0.542 | ~0.616 | 0.510 | 0.687 | 0.272 | ✓ | | 6 | **RapidOCR** | 0.508 | ~0.568 | 0.468 | 0.748 | 0.206 | ✓ | **All pipelines successful!** ✅ Dependencies (Tesseract, PaddleOCR, Docling, Unstructured, RapidOCR) all installed and working. ### Key Findings - Forms (Current Run) **PaddleOCR (Winner):** - Best F1 score (0.787) - clear winner - Best recall (0.782) - finds most text - Lowest WER (0.533) - best reading order - Best bigram overlap (0.466) - best local ordering - Excellent all-around performance **Docling VLMs (Smol & Granite):** - Tied for 2nd place (F1: 0.728) - Highest precision (0.821) - fewest false positives - Identical results suggest similar model architectures - Good for clean, accurate extraction **Unstructured:** - Solid mid-tier performance (F1: 0.631) - Multi-format support beyond just OCR - Good balance of precision and recall **Baseline Tesseract:** - Simple baseline (F1: 0.542) - Fast processing, minimal dependencies - Acceptable for straightforward documents **RapidOCR:** - Lightweight alternative (F1: 0.508) - Fastest processing - Good for resource-constrained environments ### Comparison with Historical Results **Note:** Previous benchmarks included Heron + Tesseract, which achieved the highest recall but wasn't included in this run: | Pipeline | F1 Score | Precision | Recall | WER | Bigram | Notes | |----------|----------|-----------|--------|-----|--------|-------| | **PaddleOCR** | **0.787** | 0.792 | 0.782 | 0.533 | 0.466 | ✓ Confirmed Feb 2026 | | Docling-Smol | 0.728 | 0.821 | 0.675 | 0.645 | 0.430 | ✓ Confirmed Feb 2026 | | Docling-Granite | 0.728 | 0.821 | 0.675 | 0.645 | 0.430 | ✓ Confirmed Feb 2026 | | Unstructured | 0.631 | 0.673 | 0.597 | 0.608 | 0.368 | ✓ Confirmed Feb 2026 | | Baseline Tesseract | 0.542 | 0.616 | 0.510 | 0.687 | 0.272 | ✓ Confirmed Feb 2026 | | RapidOCR | 0.508 | 0.568 | 0.468 | 0.748 | 0.206 | ✓ Confirmed Feb 2026 | | Heron + Tesseract | 0.519 | 0.384 | **0.810** | 5.324 | **0.561** | Previous result - highest recall | --- ## Receipts Category (SROIE) - Previous Results **Dataset:** SROIE - 626 receipt images with OCR text ground truth **Challenge:** Dense text, small fonts, entity extraction **Primary Metric:** F1 Score ### Results (50 documents - Previous Benchmark) | Rank | Pipeline | F1 Score | Precision | Recall | WER | LCS Ratio | |------|----------|----------|-----------|--------|-----|-----------| | 1 | **PaddleOCR** | **0.897** | - | - | - | - | | 2 | Docling-Smol | 0.856 | - | - | - | - | | 3 | Heron + Tesseract | 0.589 | 0.509 | 0.756 | 2.808 | 0.677 | | 4 | RapidOCR | 0.589 | - | - | - | - | ### Key Findings - Receipts - **PaddleOCR** dominates on receipts with F1 of 0.897 - **Docling-Smol** performs excellently (0.856) - strong VLM approach - **Heron + Tesseract** underperforms (0.589) - layout detection less beneficial for single-column receipts - Receipt OCR benefits from dense text recognition capabilities --- ## Pipeline Comparison Summary ### Overall Performance by Document Type | Pipeline | Forms F1 | Receipts F1 | Avg F1 | Best For | |----------|----------|-------------|--------|----------| | **PaddleOCR** | 0.787 | 0.897 | 0.842 | General-purpose, best overall (when available) | | **Docling-Smol** | 0.728 | 0.856 | 0.792 | High precision, VLM capabilities | | **Docling-Granite** | 0.728 | - | 0.728 | Similar to Smol, slightly more accurate | | **Heron + Tesseract** | 0.519 | 0.589 | 0.554 | Multi-column layouts, maximum recall | | **RapidOCR** | 0.508 | 0.589 | 0.549 | Lightweight, fast, CPU-only | | **Unstructured** | 0.649 | - | 0.649 | Multi-format documents | | **Baseline Tesseract** | 0.607 | - | 0.607 | Simple baseline | --- ## Recommendations by Use Case ### Production Systems | Use Case | Recommended Pipeline | F1 Score | Why | |----------|---------------------|----------|-----| | **Best Overall Accuracy** | PaddleOCR | 0.787-0.897 | Highest F1 across all categories, balanced performance | | **High Precision Needs** | Docling-Smol/Granite | 0.728 | Fewest false positives (Precision: 0.821) | | **Maximum Text Extraction** | Heron + Tesseract | 0.519 (F1) / 0.810 (Recall) | Finds 81% of all text - best when completeness matters | | **VLM-Based Extraction** | Docling-Smol | 0.728 | Good for tables, formulas, structured documents | | **Lightweight/Embedded** | RapidOCR | 0.508 | Fast, minimal dependencies, CPU-only | ### By Document Type **Forms (FUNSD-like):** 1. **PaddleOCR** (0.787) - Best overall 2. **Docling-Smol** (0.728) - High precision alternative 3. **Heron + Tesseract** (0.519 F1 / 0.810 Recall) - When completeness matters **Receipts (dense text):** 1. **PaddleOCR** (0.897) - Clear winner 2. **Docling-Smol** (0.856) - Strong alternative 3. **RapidOCR** (0.589) - Lightweight option **Academic Papers (multi-column):** - **Heron + Tesseract** - Layout-aware reading order (pending full evaluation) - **Docling-Granite** - VLM layout understanding ### By Constraint **CPU-Only (No GPU):** - RapidOCR (F1: 0.508) - Fast and lightweight - Baseline Tesseract (F1: 0.607) - If available **Maximum Recall (Can't miss text):** - Heron + Tesseract (Recall: 0.810) - Finds 81% of all words **Minimum False Positives:** - Docling-Smol (Precision: 0.821) - Cleanest output **Fastest Processing:** - RapidOCR - Lightweight, optimized for speed - Processing time: ~1 second per document --- ## Metric Interpretation Guide ### F1 Score Targets - **F1 ≥ 0.75:** Excellent (production-ready) - **F1 ≥ 0.65:** Good (acceptable for many use cases) - **F1 ≥ 0.50:** Fair (may need improvement) - **F1 < 0.50:** Poor (needs work) ### Current Standings - **Excellent (≥0.75):** PaddleOCR (0.787-0.897) - **Good (≥0.65):** Docling-Smol (0.728), Docling-Granite (0.728), Unstructured (0.649) - **Fair (≥0.50):** Baseline Tesseract (0.607), Heron + Tesseract (0.519), RapidOCR (0.508) For detailed metric explanations, see the **[Metrics Reference](metrics-reference.md)**. --- ## Benchmark Reproducibility ### Running These Benchmarks Yourself To reproduce these results: ```bash # 1. Install dependencies (optional extras as needed) pip install -e . pip install "biblicus[paddleocr]" # For PaddleOCR pip install "biblicus[docling]" # For Docling VLMs pip install "biblicus[ocr]" # For RapidOCR brew install tesseract # For Tesseract-based pipelines (macOS) # 2. Download FUNSD dataset python scripts/download_funsd_samples.py # 3. Run benchmark python scripts/benchmark_all_pipelines.py \ --corpus corpora/funsd_benchmark \ --output results/my_benchmark.json # 4. View results cat results/my_benchmark.json | jq '.pipelines[] | {name, f1: .metrics.set_based.avg_f1}' ``` For detailed instructions, see the **[Quickstart Guide](quickstart-benchmarking.md)**. ### Benchmark Configuration **Current run used:** - **Mode:** Quick (20 documents) - **Dataset:** FUNSD test set - **Pipelines:** All available with installed dependencies - **Metrics:** F1, Precision, Recall, WER, Sequence Accuracy, Bigram/Trigram overlap **Available modes:** - `quick.yaml` - 5-10 minutes (20 forms, 50 receipts) - `standard.yaml` - 30-60 minutes (50 forms, 100 receipts) - `full.yaml` - 2-4 hours (all 199 forms, all 626 receipts) --- ## Understanding Pipeline Trade-offs ### Accuracy vs. Speed | Pipeline | F1 Score | Speed | Trade-off | |----------|----------|-------|-----------| | PaddleOCR | 0.787 | Medium | Best balance | | Docling-Smol | 0.728 | Slow | Accuracy for VLM features | | RapidOCR | 0.508 | Fast | Speed for accuracy | ### Precision vs. Recall | Pipeline | Precision | Recall | Trade-off | |----------|-----------|--------|-----------| | Docling-Smol | 0.821 | 0.675 | Clean output, may miss text | | Heron + Tesseract | 0.384 | 0.810 | Finds everything, includes noise | | PaddleOCR | 0.792 | 0.782 | Balanced | **Choose high precision when:** False positives are expensive (indexing, search quality) **Choose high recall when:** Missing content is expensive (legal, compliance) --- ## Known Issues and Limitations ### Current Benchmark Run **Missing Pipelines:** - PaddleOCR - Requires `pip install "biblicus[paddleocr]"` - Tesseract-based pipelines - Requires `brew install tesseract` (macOS) - Heron + Tesseract - Requires Tesseract installation - Unstructured - Extraction succeeded but evaluation failed (text directory issue) **Next Steps:** - Install all optional dependencies - Rerun full benchmark on complete dataset (199 documents) - Add SROIE receipts benchmark - Add academic papers benchmark (dataset pending) ### General Limitations **Dataset Coverage:** - Forms: ✓ FUNSD available - Receipts: ✓ SROIE available (not rerun in current test) - Academic Papers: ✗ Dataset pending **Pipeline Coverage:** - Missing: MarkItDown, additional layout-aware combinations - Incomplete: Entity-level evaluation for receipts --- ## Future Work ### Planned Updates 1. **Rerun with All Dependencies:** - Install Tesseract, PaddleOCR - Full 199-document FUNSD benchmark - Include all 8+ pipelines 2. **Add SROIE Receipts:** - Rerun current benchmark - Add entity-level metrics - Test all pipelines on receipts 3. **Academic Papers Category:** - Find suitable dataset (Scanned ArXiv or PubLayNet) - Focus on reading order metrics (LCS, bigram) - Test layout-aware pipelines 4. **Additional Pipelines:** - MarkItDown extraction - More layout-aware combinations - Custom pipeline examples ### Contributing To contribute benchmark results: 1. Run benchmarks following the [Quickstart Guide](quickstart-benchmarking.md) 2. Share results in GitHub issues 3. Document your test environment 4. Include pipeline configurations --- ## See Also **Benchmarking Documentation:** - [Benchmarking Overview](benchmarking-overview.md) - Platform introduction - [Quickstart Guide](quickstart-benchmarking.md) - Run your own benchmarks - [Pipeline Catalog](pipeline-catalog.md) - All available pipelines - [Metrics Reference](metrics-reference.md) - Understanding metrics - [OCR Benchmarking Guide](ocr-benchmarking.md) - Detailed how-to - [Document Understanding Framework](document-understanding-benchmark.md) - Architecture **Implementation Details:** - [Heron Implementation](heron-implementation.md) - Layout detection specifics - [Layout-Aware OCR Results](layout-aware-ocr-results.md) - Detailed analysis **Source Code:** - Benchmark scripts: `scripts/benchmark_*.py` - Evaluation module: `src/biblicus/evaluation/` - Pipeline configs: `configs/*.yaml`