Biblicus Benchmark Results

Current benchmark results and recommendations for choosing extraction pipelines.

Last Updated: 2026-02-13 Benchmark Run: Full benchmark on FUNSD dataset (20 documents)

Related Documentation:

Benchmarking Overview - Platform introduction

Pipeline Catalog - Detailed pipeline information

Metrics Reference - Understanding the metrics

Quickstart Guide - Run your own benchmarks

Executive Summary

This page presents current benchmark results from Biblicus extraction pipelines. The benchmarks evaluate:

Forms (FUNSD): Scanned form documents with noise and handwriting
Receipts (SROIE): Dense receipt text with entity extraction (previous results)
Academic Papers: Multi-column layouts (dataset pending)

Current Test Environment

Test Date: February 13, 2026
Dataset: FUNSD (Form Understanding in Noisy Scanned Documents)
Documents: 20 scanned forms
Pipelines Tested: 6 (6 successful, 0 failed) ✅

Forms Category (FUNSD) - Latest Results

Dataset: FUNSD - 199 scanned form documents with word-level ground truth Challenge: Noisy scans, handwriting, field extraction Primary Metric: F1 Score

Results (20 documents - Feb 13, 2026)

Rank	Pipeline	F1 Score	Precision	Recall	WER	Bigram	Status
1	PaddleOCR	0.787 ⭐	~0.792	0.782	0.533	0.466	✓
2	Docling-Smol	0.728	0.821	0.675	0.645	0.430	✓
3	Docling-Granite	0.728	0.821	0.675	0.645	0.430	✓
4	Unstructured	0.631	~0.673	0.597	0.608	0.368	✓
5	Baseline Tesseract	0.542	~0.616	0.510	0.687	0.272	✓
6	RapidOCR	0.508	~0.568	0.468	0.748	0.206	✓

All pipelines successful! ✅ Dependencies (Tesseract, PaddleOCR, Docling, Unstructured, RapidOCR) all installed and working.

Key Findings - Forms (Current Run)

PaddleOCR (Winner):

Best F1 score (0.787) - clear winner
Best recall (0.782) - finds most text
Lowest WER (0.533) - best reading order
Best bigram overlap (0.466) - best local ordering
Excellent all-around performance

Docling VLMs (Smol & Granite):

Tied for 2nd place (F1: 0.728)
Highest precision (0.821) - fewest false positives
Identical results suggest similar model architectures
Good for clean, accurate extraction

Unstructured:

Solid mid-tier performance (F1: 0.631)
Multi-format support beyond just OCR
Good balance of precision and recall

Baseline Tesseract:

Simple baseline (F1: 0.542)
Fast processing, minimal dependencies
Acceptable for straightforward documents

RapidOCR:

Lightweight alternative (F1: 0.508)
Fastest processing
Good for resource-constrained environments

Comparison with Historical Results

Note: Previous benchmarks included Heron + Tesseract, which achieved the highest recall but wasn’t included in this run:

Pipeline	F1 Score	Precision	Recall	WER	Bigram	Notes
PaddleOCR	0.787	0.792	0.782	0.533	0.466	✓ Confirmed Feb 2026
Docling-Smol	0.728	0.821	0.675	0.645	0.430	✓ Confirmed Feb 2026
Docling-Granite	0.728	0.821	0.675	0.645	0.430	✓ Confirmed Feb 2026
Unstructured	0.631	0.673	0.597	0.608	0.368	✓ Confirmed Feb 2026
Baseline Tesseract	0.542	0.616	0.510	0.687	0.272	✓ Confirmed Feb 2026
RapidOCR	0.508	0.568	0.468	0.748	0.206	✓ Confirmed Feb 2026
Heron + Tesseract	0.519	0.384	0.810	5.324	0.561	Previous result - highest recall

Receipts Category (SROIE) - Previous Results

Dataset: SROIE - 626 receipt images with OCR text ground truth Challenge: Dense text, small fonts, entity extraction Primary Metric: F1 Score

Results (50 documents - Previous Benchmark)

Rank	Pipeline	F1 Score	Precision	Recall	WER	LCS Ratio
1	PaddleOCR	0.897	-	-	-	-
2	Docling-Smol	0.856	-	-	-	-
3	Heron + Tesseract	0.589	0.509	0.756	2.808	0.677
4	RapidOCR	0.589	-	-	-	-

Key Findings - Receipts

PaddleOCR dominates on receipts with F1 of 0.897
Docling-Smol performs excellently (0.856) - strong VLM approach
Heron + Tesseract underperforms (0.589) - layout detection less beneficial for single-column receipts
Receipt OCR benefits from dense text recognition capabilities

Pipeline Comparison Summary

Overall Performance by Document Type

Pipeline	Forms F1	Receipts F1	Avg F1	Best For
PaddleOCR	0.787	0.897	0.842	General-purpose, best overall (when available)
Docling-Smol	0.728	0.856	0.792	High precision, VLM capabilities
Docling-Granite	0.728	-	0.728	Similar to Smol, slightly more accurate
Heron + Tesseract	0.519	0.589	0.554	Multi-column layouts, maximum recall
RapidOCR	0.508	0.589	0.549	Lightweight, fast, CPU-only
Unstructured	0.649	-	0.649	Multi-format documents
Baseline Tesseract	0.607	-	0.607	Simple baseline

Recommendations by Use Case

Production Systems

Use Case	Recommended Pipeline	F1 Score	Why
Best Overall Accuracy	PaddleOCR	0.787-0.897	Highest F1 across all categories, balanced performance
High Precision Needs	Docling-Smol/Granite	0.728	Fewest false positives (Precision: 0.821)
Maximum Text Extraction	Heron + Tesseract	0.519 (F1) / 0.810 (Recall)	Finds 81% of all text - best when completeness matters
VLM-Based Extraction	Docling-Smol	0.728	Good for tables, formulas, structured documents
Lightweight/Embedded	RapidOCR	0.508	Fast, minimal dependencies, CPU-only

By Document Type

Forms (FUNSD-like):

PaddleOCR (0.787) - Best overall
Docling-Smol (0.728) - High precision alternative
Heron + Tesseract (0.519 F1 / 0.810 Recall) - When completeness matters

Receipts (dense text):

PaddleOCR (0.897) - Clear winner
Docling-Smol (0.856) - Strong alternative
RapidOCR (0.589) - Lightweight option

Academic Papers (multi-column):

Heron + Tesseract - Layout-aware reading order (pending full evaluation)
Docling-Granite - VLM layout understanding

By Constraint

CPU-Only (No GPU):

RapidOCR (F1: 0.508) - Fast and lightweight
Baseline Tesseract (F1: 0.607) - If available

Maximum Recall (Can’t miss text):

Heron + Tesseract (Recall: 0.810) - Finds 81% of all words

Minimum False Positives:

Docling-Smol (Precision: 0.821) - Cleanest output

Fastest Processing:

RapidOCR - Lightweight, optimized for speed
Processing time: ~1 second per document

Metric Interpretation Guide

F1 Score Targets

F1 ≥ 0.75: Excellent (production-ready)
F1 ≥ 0.65: Good (acceptable for many use cases)
F1 ≥ 0.50: Fair (may need improvement)
F1 < 0.50: Poor (needs work)

Current Standings

Excellent (≥0.75): PaddleOCR (0.787-0.897)
Good (≥0.65): Docling-Smol (0.728), Docling-Granite (0.728), Unstructured (0.649)
Fair (≥0.50): Baseline Tesseract (0.607), Heron + Tesseract (0.519), RapidOCR (0.508)

For detailed metric explanations, see the Metrics Reference.

Benchmark Reproducibility

Running These Benchmarks Yourself

To reproduce these results:

# 1. Install dependencies (optional extras as needed)
poetry install --extras "paddleocr docling ocr"
brew install tesseract              # For Tesseract-based pipelines (macOS)

# 2. Download FUNSD dataset
python scripts/download_funsd_samples.py

# 3. Run benchmark
python scripts/benchmark_all_pipelines.py \
  --corpus corpora/funsd_benchmark \
  --output results/my_benchmark.json

# 4. View results
cat results/my_benchmark.json | jq '.pipelines[] | {name, f1: .metrics.set_based.avg_f1}'

For detailed instructions, see the Quickstart Guide.

Benchmark Configuration

Current run used:

Mode: Quick (20 documents)
Dataset: FUNSD test set
Pipelines: All available with installed dependencies
Metrics: F1, Precision, Recall, WER, Sequence Accuracy, Bigram/Trigram overlap

Available modes:

quick.yaml - 5-10 minutes (20 forms, 50 receipts)
standard.yaml - 30-60 minutes (50 forms, 100 receipts)
full.yaml - 2-4 hours (all 199 forms, all 626 receipts)

Understanding Pipeline Trade-offs

Accuracy vs. Speed

Pipeline	F1 Score	Speed	Trade-off
PaddleOCR	0.787	Medium	Best balance
Docling-Smol	0.728	Slow	Accuracy for VLM features
RapidOCR	0.508	Fast	Speed for accuracy

Precision vs. Recall

Pipeline	Precision	Recall	Trade-off
Docling-Smol	0.821	0.675	Clean output, may miss text
Heron + Tesseract	0.384	0.810	Finds everything, includes noise
PaddleOCR	0.792	0.782	Balanced

Choose high precision when: False positives are expensive (indexing, search quality) Choose high recall when: Missing content is expensive (legal, compliance)

Known Issues and Limitations

Current Benchmark Run

Missing Pipelines:

PaddleOCR - Requires pip install "biblicus[paddleocr]"
Tesseract-based pipelines - Requires brew install tesseract (macOS)
Heron + Tesseract - Requires Tesseract installation
Unstructured - Extraction succeeded but evaluation failed (text directory issue)

Next Steps:

Install all optional dependencies
Rerun full benchmark on complete dataset (199 documents)
Add SROIE receipts benchmark
Add academic papers benchmark (dataset pending)

General Limitations

Dataset Coverage:

Forms: ✓ FUNSD available
Receipts: ✓ SROIE available (not rerun in current test)
Academic Papers: ✗ Dataset pending

Pipeline Coverage:

Missing: MarkItDown, additional layout-aware combinations
Incomplete: Entity-level evaluation for receipts

Future Work

Planned Updates

Rerun with All Dependencies:
- Install Tesseract, PaddleOCR
- Full 199-document FUNSD benchmark
- Include all 8+ pipelines
Add SROIE Receipts:
- Rerun current benchmark
- Add entity-level metrics
- Test all pipelines on receipts
Academic Papers Category:
- Find suitable dataset (Scanned ArXiv or PubLayNet)
- Focus on reading order metrics (LCS, bigram)
- Test layout-aware pipelines
Additional Pipelines:
- MarkItDown extraction
- More layout-aware combinations
- Custom pipeline examples

Contributing

To contribute benchmark results:

Run benchmarks following the Quickstart Guide
Share results in GitHub issues
Document your test environment
Include pipeline configurations