Biblicus Document Understanding Benchmark

Multi-category benchmark framework architecture and design documentation.

Looking for practical instructions? See the Quickstart Guide or OCR Benchmarking Guide. This document covers the architectural design of the multi-category framework.

The Biblicus Document Understanding Benchmark evaluates OCR and document extraction pipelines across diverse document types. Rather than testing on a single dataset, the benchmark measures performance across three distinct categories—forms, academic papers, and receipts—each presenting unique challenges for document processing systems.

Related Documentation:

Benchmarking Overview - Platform introduction
Quickstart Guide - Step-by-step instructions
Pipeline Catalog - Available pipelines
Metrics Reference - Detailed metric explanations
Current Results - Latest findings

Why a Multi-Category Benchmark?

Document extraction pipelines often excel at one document type while struggling with others. A pipeline optimized for clean academic PDFs may fail on noisy scanned forms. A receipt parser tuned for dense text may miss content in multi-column layouts.

The Biblicus benchmark reveals these trade-offs by testing pipelines across:

Category	Dataset	Documents	Challenge
Forms	FUNSD	199	Noise, handwriting, field extraction
Academic	Scanned ArXiv	100+	Multi-column layout, reading order
Receipts	SROIE	626	Dense text, entity extraction

Quick Start

# Download benchmark datasets
biblicus benchmark download --datasets funsd,sroie,scanned-arxiv

# Run quick benchmark (~5-10 minutes)
biblicus benchmark run --config configs/benchmark/quick.yaml

# Run standard benchmark (~30-60 minutes)
biblicus benchmark run

# Generate markdown report
biblicus benchmark report --output docs/guides/benchmark-results.md

Document Categories

Forms (FUNSD)

Dataset: Form Understanding in Noisy Scanned Documents (FUNSD)

FUNSD contains 199 real scanned forms from the 1980s-1990s with word-level ground truth annotations. These documents test a pipeline’s ability to handle:

Noise and degradation - Real scans with artifacts, skew, and varying quality
Structured fields - Headers, questions, answers, checkboxes
Entity extraction - Identifying form field values and their relationships

Primary Metric: F1 Score (balanced word finding)

Source: https://guillaumejaume.github.io/FUNSD/

Academic Papers (Scanned ArXiv)

Dataset: Scanned ArXiv Papers

Academic papers rendered as images (not born-digital PDFs) test layout-aware extraction:

Multi-column layouts - Two-column academic paper format
Reading order - Correct sequencing across columns
Mixed content - Text, figures, tables, equations, references

Primary Metric: LCS Ratio (Longest Common Subsequence ratio measuring reading order preservation)

Source: HuggingFace IAMJB/scanned-arxiv-papers

Receipts (SROIE)

Dataset: Scanned Receipts OCR and Information Extraction (SROIE)

626 receipt images from ICDAR 2019 with both OCR text and structured entity annotations:

Dense text - Compact layouts with small fonts
Entity extraction - Company name, date, address, total amount
Semantic understanding - Beyond raw OCR to structured data

Primary Metrics: F1 Score + Entity F1 (per-entity-type accuracy)

Source: ICDAR 2019 Competition (https://rrc.cvc.uab.es/?ch=13)

Metrics

The benchmark uses three categories of metrics to evaluate extraction quality. For complete details on each metric including formulas, interpretations, and use case recommendations, see the Metrics Reference.

Quick summary:

Set-Based Metrics (Word Finding):

Precision, Recall, F1 Score
Primary metrics for forms and receipts

Order-Aware Metrics (Sequence Quality):

LCS Ratio (primary for academic papers)
Word Error Rate (WER)
Sequence Accuracy, Bigram Overlap

Entity Metrics (Semantic Extraction):

Entity F1 (for SROIE receipts)
Per-Type F1 (date, total, company, address)

Scoring Strategy

Per-Category Scores (Primary)

Each category reports its primary metric independently:

Forms: F1 Score
Academic: LCS Ratio
Receipts: F1 Score + Entity F1

This allows comparing pipelines within each category without conflating different document types.

Weighted Aggregate (Optional)

For quick overall comparison, an optional weighted aggregate combines category scores:

Aggregate = 0.40 × Forms F1 + 0.35 × Academic LCS + 0.25 × Receipts F1

Weights are configurable in benchmark configuration files. Adjust based on your document mix.

Running Benchmarks

Benchmark Modes

Mode	Forms	Academic	Receipts	Runtime
`quick`	20 docs	20 docs	50 docs	~5-10 min
`standard`	50 docs	100 docs	100 docs	~30-60 min
`full`	199 docs	All	626 docs	~2-4 hours

Command Examples

# Quick benchmark for development iteration
biblicus benchmark run --config configs/benchmark/quick.yaml

# Standard benchmark (default)
biblicus benchmark run

# Full benchmark for release validation
biblicus benchmark run --config configs/benchmark/full.yaml

# Single category
biblicus benchmark run --category forms

# Specific pipelines
biblicus benchmark run --pipelines paddleocr,heron-tesseract,baseline-ocr

# Check dataset status
biblicus benchmark status

Configuration Files

Benchmark configurations in configs/benchmark/:

# configs/benchmark/quick.yaml
benchmark_name: quick
categories:
  forms:
    dataset: funsd
    subset_size: 20
    primary_metric: f1_score
  academic:
    dataset: scanned-arxiv
    subset_size: 20
    primary_metric: lcs_ratio
  receipts:
    dataset: sroie
    subset_size: 50
    primary_metric: f1_score
pipelines:
  - configs/baseline-ocr.yaml
  - configs/ocr-paddleocr.yaml
  - configs/heron-tesseract.yaml
aggregate_weights:
  forms: 0.40
  academic: 0.35
  receipts: 0.25

Understanding Results

JSON Output Structure

{
  "benchmark_version": "1.0.0",
  "timestamp": "2026-02-05T12:00:00Z",
  "categories": {
    "forms": {
      "dataset": "funsd",
      "documents_evaluated": 50,
      "pipelines": [
        {
          "name": "paddleocr",
          "metrics": {
            "f1": 0.787,
            "recall": 0.782,
            "precision": 0.792,
            "wer": 0.533
          }
        }
      ],
      "best_pipeline": "paddleocr"
    },
    "academic": {
      "dataset": "scanned-arxiv",
      "pipelines": [...],
      "best_pipeline": "heron-tesseract"
    },
    "receipts": {
      "dataset": "sroie",
      "pipelines": [...],
      "best_pipeline": "paddleocr"
    }
  },
  "aggregate": {
    "weighted_score": 0.72,
    "weights": {"forms": 0.40, "academic": 0.35, "receipts": 0.25}
  },
  "recommendations": {
    "best_overall": "paddleocr",
    "best_for_layout": "heron-tesseract",
    "best_for_speed": "rapidocr"
  }
}

Interpreting Trade-offs

High Recall, Lower Precision (e.g., Heron + Tesseract)

Finds more words but includes more noise
Best when missing content is costly
Use for: Completeness-critical applications, legal discovery

High Precision, Lower Recall (e.g., Docling-Smol)

Fewer false positives but may miss some text
Best when accuracy matters more than completeness
Use for: Automated data entry, structured extraction

High LCS Ratio (e.g., Layout-Aware Pipelines)

Preserves reading order in multi-column documents
May have higher WER due to region boundary effects
Use for: Academic papers, newspapers, reports

Pipeline Recommendations

Based on benchmark results:

Use Case	Recommended Pipeline	Why
General accuracy	PaddleOCR	Highest F1 across document types
Multi-column documents	Heron + Tesseract	Best reading order preservation
Receipts/forms	PaddleOCR	Strong entity extraction
Speed priority	RapidOCR	Fastest inference, acceptable accuracy
Completeness critical	Heron + Tesseract	Highest recall (0.810)
Low noise tolerance	Docling-Smol	Highest precision

Adding Custom Pipelines

To benchmark your own pipeline:

Create a pipeline configuration in configs/:

# configs/my-custom-pipeline.yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: my-layout-detector
      config:
        threshold: 0.7
    - extractor_id: ocr-tesseract
      config:
        use_layout_metadata: true

Run benchmark with your pipeline:

biblicus benchmark run --pipelines my-custom-pipeline

Compare results:

biblicus benchmark report --input results/benchmark_*.json

Dataset Downloads

Automatic Download

# Download all datasets
biblicus benchmark download --datasets funsd,sroie,scanned-arxiv

# Download specific dataset
biblicus benchmark download --datasets sroie

Manual Download

If automatic download fails:

FUNSD:

Visit https://guillaumejaume.github.io/FUNSD/
Download dataset.zip
Extract to corpora/funsd_benchmark/

SROIE:

Register at https://rrc.cvc.uab.es/?ch=13
Download task files
Run python scripts/download_sroie_samples.py --from-local /path/to/sroie

Scanned ArXiv:

from datasets import load_dataset
dataset = load_dataset("IAMJB/scanned-arxiv-papers")

Licensing

Dataset	License	Usage
FUNSD	CC BY-NC-SA 4.0	Non-commercial research
SROIE	Research only	ICDAR competition terms
Scanned ArXiv	Varies	Check per-paper license