Biblicus Document Understanding Benchmark

Multi-category benchmark framework architecture and design documentation.

Looking for practical instructions? See the Quickstart Guide or OCR Benchmarking Guide. This document covers the architectural design of the multi-category framework.

The Biblicus Document Understanding Benchmark evaluates OCR and document extraction pipelines across diverse document types. Rather than testing on a single dataset, the benchmark measures performance across three distinct categories—forms, academic papers, and receipts—each presenting unique challenges for document processing systems.

Related Documentation:

Why a Multi-Category Benchmark?

Document extraction pipelines often excel at one document type while struggling with others. A pipeline optimized for clean academic PDFs may fail on noisy scanned forms. A receipt parser tuned for dense text may miss content in multi-column layouts.

The Biblicus benchmark reveals these trade-offs by testing pipelines across:

Category

Dataset

Documents

Challenge

Forms

FUNSD

199

Noise, handwriting, field extraction

Academic

Scanned ArXiv

100+

Multi-column layout, reading order

Receipts

SROIE

626

Dense text, entity extraction

Quick Start

# Download benchmark datasets
biblicus benchmark download --datasets funsd,sroie,scanned-arxiv

# Run quick benchmark (~5-10 minutes)
biblicus benchmark run --config configs/benchmark/quick.yaml

# Run standard benchmark (~30-60 minutes)
biblicus benchmark run

# Generate markdown report
biblicus benchmark report --output docs/guides/benchmark-results.md

Document Categories

Forms (FUNSD)

Dataset: Form Understanding in Noisy Scanned Documents (FUNSD)

FUNSD contains 199 real scanned forms from the 1980s-1990s with word-level ground truth annotations. These documents test a pipeline’s ability to handle:

  • Noise and degradation - Real scans with artifacts, skew, and varying quality

  • Structured fields - Headers, questions, answers, checkboxes

  • Entity extraction - Identifying form field values and their relationships

Primary Metric: F1 Score (balanced word finding)

Source: https://guillaumejaume.github.io/FUNSD/

Academic Papers (Scanned ArXiv)

Dataset: Scanned ArXiv Papers

Academic papers rendered as images (not born-digital PDFs) test layout-aware extraction:

  • Multi-column layouts - Two-column academic paper format

  • Reading order - Correct sequencing across columns

  • Mixed content - Text, figures, tables, equations, references

Primary Metric: LCS Ratio (Longest Common Subsequence ratio measuring reading order preservation)

Source: HuggingFace IAMJB/scanned-arxiv-papers

Receipts (SROIE)

Dataset: Scanned Receipts OCR and Information Extraction (SROIE)

626 receipt images from ICDAR 2019 with both OCR text and structured entity annotations:

  • Dense text - Compact layouts with small fonts

  • Entity extraction - Company name, date, address, total amount

  • Semantic understanding - Beyond raw OCR to structured data

Primary Metrics: F1 Score + Entity F1 (per-entity-type accuracy)

Source: ICDAR 2019 Competition (https://rrc.cvc.uab.es/?ch=13)

Metrics

The benchmark uses three categories of metrics to evaluate extraction quality. For complete details on each metric including formulas, interpretations, and use case recommendations, see the Metrics Reference.

Quick summary:

Set-Based Metrics (Word Finding):

  • Precision, Recall, F1 Score

  • Primary metrics for forms and receipts

Order-Aware Metrics (Sequence Quality):

  • LCS Ratio (primary for academic papers)

  • Word Error Rate (WER)

  • Sequence Accuracy, Bigram Overlap

Entity Metrics (Semantic Extraction):

  • Entity F1 (for SROIE receipts)

  • Per-Type F1 (date, total, company, address)

Scoring Strategy

Per-Category Scores (Primary)

Each category reports its primary metric independently:

  • Forms: F1 Score

  • Academic: LCS Ratio

  • Receipts: F1 Score + Entity F1

This allows comparing pipelines within each category without conflating different document types.

Weighted Aggregate (Optional)

For quick overall comparison, an optional weighted aggregate combines category scores:

Aggregate = 0.40 × Forms F1 + 0.35 × Academic LCS + 0.25 × Receipts F1

Weights are configurable in benchmark configuration files. Adjust based on your document mix.

Running Benchmarks

Benchmark Modes

Mode

Forms

Academic

Receipts

Runtime

quick

20 docs

20 docs

50 docs

~5-10 min

standard

50 docs

100 docs

100 docs

~30-60 min

full

199 docs

All

626 docs

~2-4 hours

Command Examples

# Quick benchmark for development iteration
biblicus benchmark run --config configs/benchmark/quick.yaml

# Standard benchmark (default)
biblicus benchmark run

# Full benchmark for release validation
biblicus benchmark run --config configs/benchmark/full.yaml

# Single category
biblicus benchmark run --category forms

# Specific pipelines
biblicus benchmark run --pipelines paddleocr,heron-tesseract,baseline-ocr

# Check dataset status
biblicus benchmark status

Configuration Files

Benchmark configurations in configs/benchmark/:

# configs/benchmark/quick.yaml
benchmark_name: quick
categories:
  forms:
    dataset: funsd
    subset_size: 20
    primary_metric: f1_score
  academic:
    dataset: scanned-arxiv
    subset_size: 20
    primary_metric: lcs_ratio
  receipts:
    dataset: sroie
    subset_size: 50
    primary_metric: f1_score
pipelines:
  - configs/baseline-ocr.yaml
  - configs/ocr-paddleocr.yaml
  - configs/heron-tesseract.yaml
aggregate_weights:
  forms: 0.40
  academic: 0.35
  receipts: 0.25

Understanding Results

JSON Output Structure

{
  "benchmark_version": "1.0.0",
  "timestamp": "2026-02-05T12:00:00Z",
  "categories": {
    "forms": {
      "dataset": "funsd",
      "documents_evaluated": 50,
      "pipelines": [
        {
          "name": "paddleocr",
          "metrics": {
            "f1": 0.787,
            "recall": 0.782,
            "precision": 0.792,
            "wer": 0.533
          }
        }
      ],
      "best_pipeline": "paddleocr"
    },
    "academic": {
      "dataset": "scanned-arxiv",
      "pipelines": [...],
      "best_pipeline": "heron-tesseract"
    },
    "receipts": {
      "dataset": "sroie",
      "pipelines": [...],
      "best_pipeline": "paddleocr"
    }
  },
  "aggregate": {
    "weighted_score": 0.72,
    "weights": {"forms": 0.40, "academic": 0.35, "receipts": 0.25}
  },
  "recommendations": {
    "best_overall": "paddleocr",
    "best_for_layout": "heron-tesseract",
    "best_for_speed": "rapidocr"
  }
}

Interpreting Trade-offs

High Recall, Lower Precision (e.g., Heron + Tesseract)

  • Finds more words but includes more noise

  • Best when missing content is costly

  • Use for: Completeness-critical applications, legal discovery

High Precision, Lower Recall (e.g., Docling-Smol)

  • Fewer false positives but may miss some text

  • Best when accuracy matters more than completeness

  • Use for: Automated data entry, structured extraction

High LCS Ratio (e.g., Layout-Aware Pipelines)

  • Preserves reading order in multi-column documents

  • May have higher WER due to region boundary effects

  • Use for: Academic papers, newspapers, reports

Pipeline Recommendations

Based on benchmark results:

Use Case

Recommended Pipeline

Why

General accuracy

PaddleOCR

Highest F1 across document types

Multi-column documents

Heron + Tesseract

Best reading order preservation

Receipts/forms

PaddleOCR

Strong entity extraction

Speed priority

RapidOCR

Fastest inference, acceptable accuracy

Completeness critical

Heron + Tesseract

Highest recall (0.810)

Low noise tolerance

Docling-Smol

Highest precision

Adding Custom Pipelines

To benchmark your own pipeline:

  1. Create a pipeline configuration in configs/:

# configs/my-custom-pipeline.yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: my-layout-detector
      config:
        threshold: 0.7
    - extractor_id: ocr-tesseract
      config:
        use_layout_metadata: true
  1. Run benchmark with your pipeline:

biblicus benchmark run --pipelines my-custom-pipeline
  1. Compare results:

biblicus benchmark report --input results/benchmark_*.json

Dataset Downloads

Automatic Download

# Download all datasets
biblicus benchmark download --datasets funsd,sroie,scanned-arxiv

# Download specific dataset
biblicus benchmark download --datasets sroie

Manual Download

If automatic download fails:

FUNSD:

  1. Visit https://guillaumejaume.github.io/FUNSD/

  2. Download dataset.zip

  3. Extract to corpora/funsd_benchmark/

SROIE:

  1. Register at https://rrc.cvc.uab.es/?ch=13

  2. Download task files

  3. Run python scripts/download_sroie_samples.py --from-local /path/to/sroie

Scanned ArXiv:

from datasets import load_dataset
dataset = load_dataset("IAMJB/scanned-arxiv-papers")

Licensing

Dataset

License

Usage

FUNSD

CC BY-NC-SA 4.0

Non-commercial research

SROIE

Research only

ICDAR competition terms

Scanned ArXiv

Varies

Check per-paper license

See Also