Biblicus Benchmark Results

Current benchmark results and recommendations for choosing extraction pipelines.

Last Updated: 2026-02-13 Benchmark Run: Full benchmark on FUNSD dataset (20 documents)

Related Documentation:


Executive Summary

This page presents current benchmark results from Biblicus extraction pipelines. The benchmarks evaluate:

  • Forms (FUNSD): Scanned form documents with noise and handwriting

  • Receipts (SROIE): Dense receipt text with entity extraction (previous results)

  • Academic Papers: Multi-column layouts (dataset pending)

Current Test Environment

  • Test Date: February 13, 2026

  • Dataset: FUNSD (Form Understanding in Noisy Scanned Documents)

  • Documents: 20 scanned forms

  • Pipelines Tested: 6 (6 successful, 0 failed) ✅


Forms Category (FUNSD) - Latest Results

Dataset: FUNSD - 199 scanned form documents with word-level ground truth Challenge: Noisy scans, handwriting, field extraction Primary Metric: F1 Score

Results (20 documents - Feb 13, 2026)

Rank

Pipeline

F1 Score

Precision

Recall

WER

Bigram

Status

1

PaddleOCR

0.787

~0.792

0.782

0.533

0.466

2

Docling-Smol

0.728

0.821

0.675

0.645

0.430

3

Docling-Granite

0.728

0.821

0.675

0.645

0.430

4

Unstructured

0.631

~0.673

0.597

0.608

0.368

5

Baseline Tesseract

0.542

~0.616

0.510

0.687

0.272

6

RapidOCR

0.508

~0.568

0.468

0.748

0.206

All pipelines successful! ✅ Dependencies (Tesseract, PaddleOCR, Docling, Unstructured, RapidOCR) all installed and working.

Key Findings - Forms (Current Run)

PaddleOCR (Winner):

  • Best F1 score (0.787) - clear winner

  • Best recall (0.782) - finds most text

  • Lowest WER (0.533) - best reading order

  • Best bigram overlap (0.466) - best local ordering

  • Excellent all-around performance

Docling VLMs (Smol & Granite):

  • Tied for 2nd place (F1: 0.728)

  • Highest precision (0.821) - fewest false positives

  • Identical results suggest similar model architectures

  • Good for clean, accurate extraction

Unstructured:

  • Solid mid-tier performance (F1: 0.631)

  • Multi-format support beyond just OCR

  • Good balance of precision and recall

Baseline Tesseract:

  • Simple baseline (F1: 0.542)

  • Fast processing, minimal dependencies

  • Acceptable for straightforward documents

RapidOCR:

  • Lightweight alternative (F1: 0.508)

  • Fastest processing

  • Good for resource-constrained environments

Comparison with Historical Results

Note: Previous benchmarks included Heron + Tesseract, which achieved the highest recall but wasn’t included in this run:

Pipeline

F1 Score

Precision

Recall

WER

Bigram

Notes

PaddleOCR

0.787

0.792

0.782

0.533

0.466

✓ Confirmed Feb 2026

Docling-Smol

0.728

0.821

0.675

0.645

0.430

✓ Confirmed Feb 2026

Docling-Granite

0.728

0.821

0.675

0.645

0.430

✓ Confirmed Feb 2026

Unstructured

0.631

0.673

0.597

0.608

0.368

✓ Confirmed Feb 2026

Baseline Tesseract

0.542

0.616

0.510

0.687

0.272

✓ Confirmed Feb 2026

RapidOCR

0.508

0.568

0.468

0.748

0.206

✓ Confirmed Feb 2026

Heron + Tesseract

0.519

0.384

0.810

5.324

0.561

Previous result - highest recall


Receipts Category (SROIE) - Previous Results

Dataset: SROIE - 626 receipt images with OCR text ground truth Challenge: Dense text, small fonts, entity extraction Primary Metric: F1 Score

Results (50 documents - Previous Benchmark)

Rank

Pipeline

F1 Score

Precision

Recall

WER

LCS Ratio

1

PaddleOCR

0.897

-

-

-

-

2

Docling-Smol

0.856

-

-

-

-

3

Heron + Tesseract

0.589

0.509

0.756

2.808

0.677

4

RapidOCR

0.589

-

-

-

-

Key Findings - Receipts

  • PaddleOCR dominates on receipts with F1 of 0.897

  • Docling-Smol performs excellently (0.856) - strong VLM approach

  • Heron + Tesseract underperforms (0.589) - layout detection less beneficial for single-column receipts

  • Receipt OCR benefits from dense text recognition capabilities


Pipeline Comparison Summary

Overall Performance by Document Type

Pipeline

Forms F1

Receipts F1

Avg F1

Best For

PaddleOCR

0.787

0.897

0.842

General-purpose, best overall (when available)

Docling-Smol

0.728

0.856

0.792

High precision, VLM capabilities

Docling-Granite

0.728

-

0.728

Similar to Smol, slightly more accurate

Heron + Tesseract

0.519

0.589

0.554

Multi-column layouts, maximum recall

RapidOCR

0.508

0.589

0.549

Lightweight, fast, CPU-only

Unstructured

0.649

-

0.649

Multi-format documents

Baseline Tesseract

0.607

-

0.607

Simple baseline


Recommendations by Use Case

Production Systems

Use Case

Recommended Pipeline

F1 Score

Why

Best Overall Accuracy

PaddleOCR

0.787-0.897

Highest F1 across all categories, balanced performance

High Precision Needs

Docling-Smol/Granite

0.728

Fewest false positives (Precision: 0.821)

Maximum Text Extraction

Heron + Tesseract

0.519 (F1) / 0.810 (Recall)

Finds 81% of all text - best when completeness matters

VLM-Based Extraction

Docling-Smol

0.728

Good for tables, formulas, structured documents

Lightweight/Embedded

RapidOCR

0.508

Fast, minimal dependencies, CPU-only

By Document Type

Forms (FUNSD-like):

  1. PaddleOCR (0.787) - Best overall

  2. Docling-Smol (0.728) - High precision alternative

  3. Heron + Tesseract (0.519 F1 / 0.810 Recall) - When completeness matters

Receipts (dense text):

  1. PaddleOCR (0.897) - Clear winner

  2. Docling-Smol (0.856) - Strong alternative

  3. RapidOCR (0.589) - Lightweight option

Academic Papers (multi-column):

  • Heron + Tesseract - Layout-aware reading order (pending full evaluation)

  • Docling-Granite - VLM layout understanding

By Constraint

CPU-Only (No GPU):

  • RapidOCR (F1: 0.508) - Fast and lightweight

  • Baseline Tesseract (F1: 0.607) - If available

Maximum Recall (Can’t miss text):

  • Heron + Tesseract (Recall: 0.810) - Finds 81% of all words

Minimum False Positives:

  • Docling-Smol (Precision: 0.821) - Cleanest output

Fastest Processing:

  • RapidOCR - Lightweight, optimized for speed

  • Processing time: ~1 second per document


Metric Interpretation Guide

F1 Score Targets

  • F1 ≥ 0.75: Excellent (production-ready)

  • F1 ≥ 0.65: Good (acceptable for many use cases)

  • F1 ≥ 0.50: Fair (may need improvement)

  • F1 < 0.50: Poor (needs work)

Current Standings

  • Excellent (≥0.75): PaddleOCR (0.787-0.897)

  • Good (≥0.65): Docling-Smol (0.728), Docling-Granite (0.728), Unstructured (0.649)

  • Fair (≥0.50): Baseline Tesseract (0.607), Heron + Tesseract (0.519), RapidOCR (0.508)

For detailed metric explanations, see the Metrics Reference.


Benchmark Reproducibility

Running These Benchmarks Yourself

To reproduce these results:

# 1. Install dependencies (optional extras as needed)
pip install -e .
pip install "biblicus[paddleocr]"  # For PaddleOCR
pip install "biblicus[docling]"     # For Docling VLMs
pip install "biblicus[ocr]"         # For RapidOCR
brew install tesseract              # For Tesseract-based pipelines (macOS)

# 2. Download FUNSD dataset
python scripts/download_funsd_samples.py

# 3. Run benchmark
python scripts/benchmark_all_pipelines.py \
  --corpus corpora/funsd_benchmark \
  --output results/my_benchmark.json

# 4. View results
cat results/my_benchmark.json | jq '.pipelines[] | {name, f1: .metrics.set_based.avg_f1}'

For detailed instructions, see the Quickstart Guide.

Benchmark Configuration

Current run used:

  • Mode: Quick (20 documents)

  • Dataset: FUNSD test set

  • Pipelines: All available with installed dependencies

  • Metrics: F1, Precision, Recall, WER, Sequence Accuracy, Bigram/Trigram overlap

Available modes:

  • quick.yaml - 5-10 minutes (20 forms, 50 receipts)

  • standard.yaml - 30-60 minutes (50 forms, 100 receipts)

  • full.yaml - 2-4 hours (all 199 forms, all 626 receipts)


Understanding Pipeline Trade-offs

Accuracy vs. Speed

Pipeline

F1 Score

Speed

Trade-off

PaddleOCR

0.787

Medium

Best balance

Docling-Smol

0.728

Slow

Accuracy for VLM features

RapidOCR

0.508

Fast

Speed for accuracy

Precision vs. Recall

Pipeline

Precision

Recall

Trade-off

Docling-Smol

0.821

0.675

Clean output, may miss text

Heron + Tesseract

0.384

0.810

Finds everything, includes noise

PaddleOCR

0.792

0.782

Balanced

Choose high precision when: False positives are expensive (indexing, search quality) Choose high recall when: Missing content is expensive (legal, compliance)


Known Issues and Limitations

Current Benchmark Run

Missing Pipelines:

  • PaddleOCR - Requires pip install "biblicus[paddleocr]"

  • Tesseract-based pipelines - Requires brew install tesseract (macOS)

  • Heron + Tesseract - Requires Tesseract installation

  • Unstructured - Extraction succeeded but evaluation failed (text directory issue)

Next Steps:

  • Install all optional dependencies

  • Rerun full benchmark on complete dataset (199 documents)

  • Add SROIE receipts benchmark

  • Add academic papers benchmark (dataset pending)

General Limitations

Dataset Coverage:

  • Forms: ✓ FUNSD available

  • Receipts: ✓ SROIE available (not rerun in current test)

  • Academic Papers: ✗ Dataset pending

Pipeline Coverage:

  • Missing: MarkItDown, additional layout-aware combinations

  • Incomplete: Entity-level evaluation for receipts


Future Work

Planned Updates

  1. Rerun with All Dependencies:

    • Install Tesseract, PaddleOCR

    • Full 199-document FUNSD benchmark

    • Include all 8+ pipelines

  2. Add SROIE Receipts:

    • Rerun current benchmark

    • Add entity-level metrics

    • Test all pipelines on receipts

  3. Academic Papers Category:

    • Find suitable dataset (Scanned ArXiv or PubLayNet)

    • Focus on reading order metrics (LCS, bigram)

    • Test layout-aware pipelines

  4. Additional Pipelines:

    • MarkItDown extraction

    • More layout-aware combinations

    • Custom pipeline examples

Contributing

To contribute benchmark results:

  1. Run benchmarks following the Quickstart Guide

  2. Share results in GitHub issues

  3. Document your test environment

  4. Include pipeline configurations


See Also

Benchmarking Documentation:

Implementation Details:

Source Code:

  • Benchmark scripts: scripts/benchmark_*.py

  • Evaluation module: src/biblicus/evaluation/

  • Pipeline configs: configs/*.yaml