Biblicus Document Understanding Benchmark
Multi-category benchmark framework architecture and design documentation.
Looking for practical instructions? See the Quickstart Guide or OCR Benchmarking Guide. This document covers the architectural design of the multi-category framework.
The Biblicus Document Understanding Benchmark evaluates OCR and document extraction pipelines across diverse document types. Rather than testing on a single dataset, the benchmark measures performance across three distinct categories—forms, academic papers, and receipts—each presenting unique challenges for document processing systems.
Related Documentation:
Benchmarking Overview - Platform introduction
Quickstart Guide - Step-by-step instructions
Pipeline Catalog - Available pipelines
Metrics Reference - Detailed metric explanations
Current Results - Latest findings
Why a Multi-Category Benchmark?
Document extraction pipelines often excel at one document type while struggling with others. A pipeline optimized for clean academic PDFs may fail on noisy scanned forms. A receipt parser tuned for dense text may miss content in multi-column layouts.
The Biblicus benchmark reveals these trade-offs by testing pipelines across:
Category |
Dataset |
Documents |
Challenge |
|---|---|---|---|
Forms |
FUNSD |
199 |
Noise, handwriting, field extraction |
Academic |
Scanned ArXiv |
100+ |
Multi-column layout, reading order |
Receipts |
SROIE |
626 |
Dense text, entity extraction |
Quick Start
# Download benchmark datasets
biblicus benchmark download --datasets funsd,sroie,scanned-arxiv
# Run quick benchmark (~5-10 minutes)
biblicus benchmark run --config configs/benchmark/quick.yaml
# Run standard benchmark (~30-60 minutes)
biblicus benchmark run
# Generate markdown report
biblicus benchmark report --output docs/guides/benchmark-results.md
Document Categories
Forms (FUNSD)
Dataset: Form Understanding in Noisy Scanned Documents (FUNSD)
FUNSD contains 199 real scanned forms from the 1980s-1990s with word-level ground truth annotations. These documents test a pipeline’s ability to handle:
Noise and degradation - Real scans with artifacts, skew, and varying quality
Structured fields - Headers, questions, answers, checkboxes
Entity extraction - Identifying form field values and their relationships
Primary Metric: F1 Score (balanced word finding)
Source: https://guillaumejaume.github.io/FUNSD/
Academic Papers (Scanned ArXiv)
Dataset: Scanned ArXiv Papers
Academic papers rendered as images (not born-digital PDFs) test layout-aware extraction:
Multi-column layouts - Two-column academic paper format
Reading order - Correct sequencing across columns
Mixed content - Text, figures, tables, equations, references
Primary Metric: LCS Ratio (Longest Common Subsequence ratio measuring reading order preservation)
Source: HuggingFace IAMJB/scanned-arxiv-papers
Receipts (SROIE)
Dataset: Scanned Receipts OCR and Information Extraction (SROIE)
626 receipt images from ICDAR 2019 with both OCR text and structured entity annotations:
Dense text - Compact layouts with small fonts
Entity extraction - Company name, date, address, total amount
Semantic understanding - Beyond raw OCR to structured data
Primary Metrics: F1 Score + Entity F1 (per-entity-type accuracy)
Source: ICDAR 2019 Competition (https://rrc.cvc.uab.es/?ch=13)
Metrics
The benchmark uses three categories of metrics to evaluate extraction quality. For complete details on each metric including formulas, interpretations, and use case recommendations, see the Metrics Reference.
Quick summary:
Set-Based Metrics (Word Finding):
Precision, Recall, F1 Score
Primary metrics for forms and receipts
Order-Aware Metrics (Sequence Quality):
LCS Ratio (primary for academic papers)
Word Error Rate (WER)
Sequence Accuracy, Bigram Overlap
Entity Metrics (Semantic Extraction):
Entity F1 (for SROIE receipts)
Per-Type F1 (date, total, company, address)
Scoring Strategy
Per-Category Scores (Primary)
Each category reports its primary metric independently:
Forms: F1 Score
Academic: LCS Ratio
Receipts: F1 Score + Entity F1
This allows comparing pipelines within each category without conflating different document types.
Weighted Aggregate (Optional)
For quick overall comparison, an optional weighted aggregate combines category scores:
Aggregate = 0.40 × Forms F1 + 0.35 × Academic LCS + 0.25 × Receipts F1
Weights are configurable in benchmark configuration files. Adjust based on your document mix.
Running Benchmarks
Benchmark Modes
Mode |
Forms |
Academic |
Receipts |
Runtime |
|---|---|---|---|---|
|
20 docs |
20 docs |
50 docs |
~5-10 min |
|
50 docs |
100 docs |
100 docs |
~30-60 min |
|
199 docs |
All |
626 docs |
~2-4 hours |
Command Examples
# Quick benchmark for development iteration
biblicus benchmark run --config configs/benchmark/quick.yaml
# Standard benchmark (default)
biblicus benchmark run
# Full benchmark for release validation
biblicus benchmark run --config configs/benchmark/full.yaml
# Single category
biblicus benchmark run --category forms
# Specific pipelines
biblicus benchmark run --pipelines paddleocr,heron-tesseract,baseline-ocr
# Check dataset status
biblicus benchmark status
Configuration Files
Benchmark configurations in configs/benchmark/:
# configs/benchmark/quick.yaml
benchmark_name: quick
categories:
forms:
dataset: funsd
subset_size: 20
primary_metric: f1_score
academic:
dataset: scanned-arxiv
subset_size: 20
primary_metric: lcs_ratio
receipts:
dataset: sroie
subset_size: 50
primary_metric: f1_score
pipelines:
- configs/baseline-ocr.yaml
- configs/ocr-paddleocr.yaml
- configs/heron-tesseract.yaml
aggregate_weights:
forms: 0.40
academic: 0.35
receipts: 0.25
Understanding Results
JSON Output Structure
{
"benchmark_version": "1.0.0",
"timestamp": "2026-02-05T12:00:00Z",
"categories": {
"forms": {
"dataset": "funsd",
"documents_evaluated": 50,
"pipelines": [
{
"name": "paddleocr",
"metrics": {
"f1": 0.787,
"recall": 0.782,
"precision": 0.792,
"wer": 0.533
}
}
],
"best_pipeline": "paddleocr"
},
"academic": {
"dataset": "scanned-arxiv",
"pipelines": [...],
"best_pipeline": "heron-tesseract"
},
"receipts": {
"dataset": "sroie",
"pipelines": [...],
"best_pipeline": "paddleocr"
}
},
"aggregate": {
"weighted_score": 0.72,
"weights": {"forms": 0.40, "academic": 0.35, "receipts": 0.25}
},
"recommendations": {
"best_overall": "paddleocr",
"best_for_layout": "heron-tesseract",
"best_for_speed": "rapidocr"
}
}
Interpreting Trade-offs
High Recall, Lower Precision (e.g., Heron + Tesseract)
Finds more words but includes more noise
Best when missing content is costly
Use for: Completeness-critical applications, legal discovery
High Precision, Lower Recall (e.g., Docling-Smol)
Fewer false positives but may miss some text
Best when accuracy matters more than completeness
Use for: Automated data entry, structured extraction
High LCS Ratio (e.g., Layout-Aware Pipelines)
Preserves reading order in multi-column documents
May have higher WER due to region boundary effects
Use for: Academic papers, newspapers, reports
Pipeline Recommendations
Based on benchmark results:
Use Case |
Recommended Pipeline |
Why |
|---|---|---|
General accuracy |
PaddleOCR |
Highest F1 across document types |
Multi-column documents |
Heron + Tesseract |
Best reading order preservation |
Receipts/forms |
PaddleOCR |
Strong entity extraction |
Speed priority |
RapidOCR |
Fastest inference, acceptable accuracy |
Completeness critical |
Heron + Tesseract |
Highest recall (0.810) |
Low noise tolerance |
Docling-Smol |
Highest precision |
Adding Custom Pipelines
To benchmark your own pipeline:
Create a pipeline configuration in
configs/:
# configs/my-custom-pipeline.yaml
extractor_id: pipeline
config:
stages:
- extractor_id: my-layout-detector
config:
threshold: 0.7
- extractor_id: ocr-tesseract
config:
use_layout_metadata: true
Run benchmark with your pipeline:
biblicus benchmark run --pipelines my-custom-pipeline
Compare results:
biblicus benchmark report --input results/benchmark_*.json
Dataset Downloads
Automatic Download
# Download all datasets
biblicus benchmark download --datasets funsd,sroie,scanned-arxiv
# Download specific dataset
biblicus benchmark download --datasets sroie
Manual Download
If automatic download fails:
FUNSD:
Visit https://guillaumejaume.github.io/FUNSD/
Download
dataset.zipExtract to
corpora/funsd_benchmark/
SROIE:
Register at https://rrc.cvc.uab.es/?ch=13
Download task files
Run
python scripts/download_sroie_samples.py --from-local /path/to/sroie
Scanned ArXiv:
from datasets import load_dataset
dataset = load_dataset("IAMJB/scanned-arxiv-papers")
Licensing
Dataset |
License |
Usage |
|---|---|---|
FUNSD |
CC BY-NC-SA 4.0 |
Non-commercial research |
SROIE |
Research only |
ICDAR competition terms |
Scanned ArXiv |
Varies |
Check per-paper license |
See Also
OCR Benchmarking Guide - Detailed OCR pipeline evaluation
Benchmark Results - Current benchmark results
Heron Implementation - Layout detection details
Extractors Overview - Available extraction pipelines