Biblicus Benchmark Results
Current benchmark results and recommendations for choosing extraction pipelines.
Last Updated: 2026-02-13 Benchmark Run: Full benchmark on FUNSD dataset (20 documents)
Related Documentation:
Benchmarking Overview - Platform introduction
Pipeline Catalog - Detailed pipeline information
Metrics Reference - Understanding the metrics
Quickstart Guide - Run your own benchmarks
Executive Summary
This page presents current benchmark results from Biblicus extraction pipelines. The benchmarks evaluate:
Forms (FUNSD): Scanned form documents with noise and handwriting
Receipts (SROIE): Dense receipt text with entity extraction (previous results)
Academic Papers: Multi-column layouts (dataset pending)
Current Test Environment
Test Date: February 13, 2026
Dataset: FUNSD (Form Understanding in Noisy Scanned Documents)
Documents: 20 scanned forms
Pipelines Tested: 6 (6 successful, 0 failed) ✅
Forms Category (FUNSD) - Latest Results
Dataset: FUNSD - 199 scanned form documents with word-level ground truth Challenge: Noisy scans, handwriting, field extraction Primary Metric: F1 Score
Results (20 documents - Feb 13, 2026)
Rank |
Pipeline |
F1 Score |
Precision |
Recall |
WER |
Bigram |
Status |
|---|---|---|---|---|---|---|---|
1 |
PaddleOCR |
0.787 ⭐ |
~0.792 |
0.782 |
0.533 |
0.466 |
✓ |
2 |
Docling-Smol |
0.728 |
0.821 |
0.675 |
0.645 |
0.430 |
✓ |
3 |
Docling-Granite |
0.728 |
0.821 |
0.675 |
0.645 |
0.430 |
✓ |
4 |
Unstructured |
0.631 |
~0.673 |
0.597 |
0.608 |
0.368 |
✓ |
5 |
Baseline Tesseract |
0.542 |
~0.616 |
0.510 |
0.687 |
0.272 |
✓ |
6 |
RapidOCR |
0.508 |
~0.568 |
0.468 |
0.748 |
0.206 |
✓ |
All pipelines successful! ✅ Dependencies (Tesseract, PaddleOCR, Docling, Unstructured, RapidOCR) all installed and working.
Key Findings - Forms (Current Run)
PaddleOCR (Winner):
Best F1 score (0.787) - clear winner
Best recall (0.782) - finds most text
Lowest WER (0.533) - best reading order
Best bigram overlap (0.466) - best local ordering
Excellent all-around performance
Docling VLMs (Smol & Granite):
Tied for 2nd place (F1: 0.728)
Highest precision (0.821) - fewest false positives
Identical results suggest similar model architectures
Good for clean, accurate extraction
Unstructured:
Solid mid-tier performance (F1: 0.631)
Multi-format support beyond just OCR
Good balance of precision and recall
Baseline Tesseract:
Simple baseline (F1: 0.542)
Fast processing, minimal dependencies
Acceptable for straightforward documents
RapidOCR:
Lightweight alternative (F1: 0.508)
Fastest processing
Good for resource-constrained environments
Comparison with Historical Results
Note: Previous benchmarks included Heron + Tesseract, which achieved the highest recall but wasn’t included in this run:
Pipeline |
F1 Score |
Precision |
Recall |
WER |
Bigram |
Notes |
|---|---|---|---|---|---|---|
PaddleOCR |
0.787 |
0.792 |
0.782 |
0.533 |
0.466 |
✓ Confirmed Feb 2026 |
Docling-Smol |
0.728 |
0.821 |
0.675 |
0.645 |
0.430 |
✓ Confirmed Feb 2026 |
Docling-Granite |
0.728 |
0.821 |
0.675 |
0.645 |
0.430 |
✓ Confirmed Feb 2026 |
Unstructured |
0.631 |
0.673 |
0.597 |
0.608 |
0.368 |
✓ Confirmed Feb 2026 |
Baseline Tesseract |
0.542 |
0.616 |
0.510 |
0.687 |
0.272 |
✓ Confirmed Feb 2026 |
RapidOCR |
0.508 |
0.568 |
0.468 |
0.748 |
0.206 |
✓ Confirmed Feb 2026 |
Heron + Tesseract |
0.519 |
0.384 |
0.810 |
5.324 |
0.561 |
Previous result - highest recall |
Receipts Category (SROIE) - Previous Results
Dataset: SROIE - 626 receipt images with OCR text ground truth Challenge: Dense text, small fonts, entity extraction Primary Metric: F1 Score
Results (50 documents - Previous Benchmark)
Rank |
Pipeline |
F1 Score |
Precision |
Recall |
WER |
LCS Ratio |
|---|---|---|---|---|---|---|
1 |
PaddleOCR |
0.897 |
- |
- |
- |
- |
2 |
Docling-Smol |
0.856 |
- |
- |
- |
- |
3 |
Heron + Tesseract |
0.589 |
0.509 |
0.756 |
2.808 |
0.677 |
4 |
RapidOCR |
0.589 |
- |
- |
- |
- |
Key Findings - Receipts
PaddleOCR dominates on receipts with F1 of 0.897
Docling-Smol performs excellently (0.856) - strong VLM approach
Heron + Tesseract underperforms (0.589) - layout detection less beneficial for single-column receipts
Receipt OCR benefits from dense text recognition capabilities
Pipeline Comparison Summary
Overall Performance by Document Type
Pipeline |
Forms F1 |
Receipts F1 |
Avg F1 |
Best For |
|---|---|---|---|---|
PaddleOCR |
0.787 |
0.897 |
0.842 |
General-purpose, best overall (when available) |
Docling-Smol |
0.728 |
0.856 |
0.792 |
High precision, VLM capabilities |
Docling-Granite |
0.728 |
- |
0.728 |
Similar to Smol, slightly more accurate |
Heron + Tesseract |
0.519 |
0.589 |
0.554 |
Multi-column layouts, maximum recall |
RapidOCR |
0.508 |
0.589 |
0.549 |
Lightweight, fast, CPU-only |
Unstructured |
0.649 |
- |
0.649 |
Multi-format documents |
Baseline Tesseract |
0.607 |
- |
0.607 |
Simple baseline |
Recommendations by Use Case
Production Systems
Use Case |
Recommended Pipeline |
F1 Score |
Why |
|---|---|---|---|
Best Overall Accuracy |
PaddleOCR |
0.787-0.897 |
Highest F1 across all categories, balanced performance |
High Precision Needs |
Docling-Smol/Granite |
0.728 |
Fewest false positives (Precision: 0.821) |
Maximum Text Extraction |
Heron + Tesseract |
0.519 (F1) / 0.810 (Recall) |
Finds 81% of all text - best when completeness matters |
VLM-Based Extraction |
Docling-Smol |
0.728 |
Good for tables, formulas, structured documents |
Lightweight/Embedded |
RapidOCR |
0.508 |
Fast, minimal dependencies, CPU-only |
By Document Type
Forms (FUNSD-like):
PaddleOCR (0.787) - Best overall
Docling-Smol (0.728) - High precision alternative
Heron + Tesseract (0.519 F1 / 0.810 Recall) - When completeness matters
Receipts (dense text):
PaddleOCR (0.897) - Clear winner
Docling-Smol (0.856) - Strong alternative
RapidOCR (0.589) - Lightweight option
Academic Papers (multi-column):
Heron + Tesseract - Layout-aware reading order (pending full evaluation)
Docling-Granite - VLM layout understanding
By Constraint
CPU-Only (No GPU):
RapidOCR (F1: 0.508) - Fast and lightweight
Baseline Tesseract (F1: 0.607) - If available
Maximum Recall (Can’t miss text):
Heron + Tesseract (Recall: 0.810) - Finds 81% of all words
Minimum False Positives:
Docling-Smol (Precision: 0.821) - Cleanest output
Fastest Processing:
RapidOCR - Lightweight, optimized for speed
Processing time: ~1 second per document
Metric Interpretation Guide
F1 Score Targets
F1 ≥ 0.75: Excellent (production-ready)
F1 ≥ 0.65: Good (acceptable for many use cases)
F1 ≥ 0.50: Fair (may need improvement)
F1 < 0.50: Poor (needs work)
Current Standings
Excellent (≥0.75): PaddleOCR (0.787-0.897)
Good (≥0.65): Docling-Smol (0.728), Docling-Granite (0.728), Unstructured (0.649)
Fair (≥0.50): Baseline Tesseract (0.607), Heron + Tesseract (0.519), RapidOCR (0.508)
For detailed metric explanations, see the Metrics Reference.
Benchmark Reproducibility
Running These Benchmarks Yourself
To reproduce these results:
# 1. Install dependencies (optional extras as needed)
pip install -e .
pip install "biblicus[paddleocr]" # For PaddleOCR
pip install "biblicus[docling]" # For Docling VLMs
pip install "biblicus[ocr]" # For RapidOCR
brew install tesseract # For Tesseract-based pipelines (macOS)
# 2. Download FUNSD dataset
python scripts/download_funsd_samples.py
# 3. Run benchmark
python scripts/benchmark_all_pipelines.py \
--corpus corpora/funsd_benchmark \
--output results/my_benchmark.json
# 4. View results
cat results/my_benchmark.json | jq '.pipelines[] | {name, f1: .metrics.set_based.avg_f1}'
For detailed instructions, see the Quickstart Guide.
Benchmark Configuration
Current run used:
Mode: Quick (20 documents)
Dataset: FUNSD test set
Pipelines: All available with installed dependencies
Metrics: F1, Precision, Recall, WER, Sequence Accuracy, Bigram/Trigram overlap
Available modes:
quick.yaml- 5-10 minutes (20 forms, 50 receipts)standard.yaml- 30-60 minutes (50 forms, 100 receipts)full.yaml- 2-4 hours (all 199 forms, all 626 receipts)
Understanding Pipeline Trade-offs
Accuracy vs. Speed
Pipeline |
F1 Score |
Speed |
Trade-off |
|---|---|---|---|
PaddleOCR |
0.787 |
Medium |
Best balance |
Docling-Smol |
0.728 |
Slow |
Accuracy for VLM features |
RapidOCR |
0.508 |
Fast |
Speed for accuracy |
Precision vs. Recall
Pipeline |
Precision |
Recall |
Trade-off |
|---|---|---|---|
Docling-Smol |
0.821 |
0.675 |
Clean output, may miss text |
Heron + Tesseract |
0.384 |
0.810 |
Finds everything, includes noise |
PaddleOCR |
0.792 |
0.782 |
Balanced |
Choose high precision when: False positives are expensive (indexing, search quality) Choose high recall when: Missing content is expensive (legal, compliance)
Known Issues and Limitations
Current Benchmark Run
Missing Pipelines:
PaddleOCR - Requires
pip install "biblicus[paddleocr]"Tesseract-based pipelines - Requires
brew install tesseract(macOS)Heron + Tesseract - Requires Tesseract installation
Unstructured - Extraction succeeded but evaluation failed (text directory issue)
Next Steps:
Install all optional dependencies
Rerun full benchmark on complete dataset (199 documents)
Add SROIE receipts benchmark
Add academic papers benchmark (dataset pending)
General Limitations
Dataset Coverage:
Forms: ✓ FUNSD available
Receipts: ✓ SROIE available (not rerun in current test)
Academic Papers: ✗ Dataset pending
Pipeline Coverage:
Missing: MarkItDown, additional layout-aware combinations
Incomplete: Entity-level evaluation for receipts
Future Work
Planned Updates
Rerun with All Dependencies:
Install Tesseract, PaddleOCR
Full 199-document FUNSD benchmark
Include all 8+ pipelines
Add SROIE Receipts:
Rerun current benchmark
Add entity-level metrics
Test all pipelines on receipts
Academic Papers Category:
Find suitable dataset (Scanned ArXiv or PubLayNet)
Focus on reading order metrics (LCS, bigram)
Test layout-aware pipelines
Additional Pipelines:
MarkItDown extraction
More layout-aware combinations
Custom pipeline examples
Contributing
To contribute benchmark results:
Run benchmarks following the Quickstart Guide
Share results in GitHub issues
Document your test environment
Include pipeline configurations
See Also
Benchmarking Documentation:
Benchmarking Overview - Platform introduction
Quickstart Guide - Run your own benchmarks
Pipeline Catalog - All available pipelines
Metrics Reference - Understanding metrics
OCR Benchmarking Guide - Detailed how-to
Document Understanding Framework - Architecture
Implementation Details:
Heron Implementation - Layout detection specifics
Layout-Aware OCR Results - Detailed analysis
Source Code:
Benchmark scripts:
scripts/benchmark_*.pyEvaluation module:
src/biblicus/evaluation/Pipeline configs:
configs/*.yaml