# Biblicus Document Understanding Benchmark Multi-category benchmark framework architecture and design documentation. > **Looking for practical instructions?** See the [Quickstart Guide](quickstart-benchmarking.md) or [OCR Benchmarking Guide](ocr-benchmarking.md). This document covers the architectural design of the multi-category framework. The Biblicus Document Understanding Benchmark evaluates OCR and document extraction pipelines across diverse document types. Rather than testing on a single dataset, the benchmark measures performance across three distinct categories—forms, academic papers, and receipts—each presenting unique challenges for document processing systems. **Related Documentation:** - [Benchmarking Overview](benchmarking-overview.md) - Platform introduction - [Quickstart Guide](quickstart-benchmarking.md) - Step-by-step instructions - [Pipeline Catalog](pipeline-catalog.md) - Available pipelines - [Metrics Reference](metrics-reference.md) - Detailed metric explanations - [Current Results](benchmark-results.md) - Latest findings ## Why a Multi-Category Benchmark? Document extraction pipelines often excel at one document type while struggling with others. A pipeline optimized for clean academic PDFs may fail on noisy scanned forms. A receipt parser tuned for dense text may miss content in multi-column layouts. The Biblicus benchmark reveals these trade-offs by testing pipelines across: | Category | Dataset | Documents | Challenge | |----------|---------|-----------|-----------| | **Forms** | FUNSD | 199 | Noise, handwriting, field extraction | | **Academic** | Scanned ArXiv | 100+ | Multi-column layout, reading order | | **Receipts** | SROIE | 626 | Dense text, entity extraction | ## Quick Start ```bash # Download benchmark datasets biblicus benchmark download --datasets funsd,sroie,scanned-arxiv # Run quick benchmark (~5-10 minutes) biblicus benchmark run --config configs/benchmark/quick.yaml # Run standard benchmark (~30-60 minutes) biblicus benchmark run # Generate markdown report biblicus benchmark report --output docs/guides/benchmark-results.md ``` ## Document Categories ### Forms (FUNSD) **Dataset:** Form Understanding in Noisy Scanned Documents (FUNSD) FUNSD contains 199 real scanned forms from the 1980s-1990s with word-level ground truth annotations. These documents test a pipeline's ability to handle: - **Noise and degradation** - Real scans with artifacts, skew, and varying quality - **Structured fields** - Headers, questions, answers, checkboxes - **Entity extraction** - Identifying form field values and their relationships **Primary Metric:** F1 Score (balanced word finding) **Source:** https://guillaumejaume.github.io/FUNSD/ ### Academic Papers (Scanned ArXiv) **Dataset:** Scanned ArXiv Papers Academic papers rendered as images (not born-digital PDFs) test layout-aware extraction: - **Multi-column layouts** - Two-column academic paper format - **Reading order** - Correct sequencing across columns - **Mixed content** - Text, figures, tables, equations, references **Primary Metric:** LCS Ratio (Longest Common Subsequence ratio measuring reading order preservation) **Source:** HuggingFace `IAMJB/scanned-arxiv-papers` ### Receipts (SROIE) **Dataset:** Scanned Receipts OCR and Information Extraction (SROIE) 626 receipt images from ICDAR 2019 with both OCR text and structured entity annotations: - **Dense text** - Compact layouts with small fonts - **Entity extraction** - Company name, date, address, total amount - **Semantic understanding** - Beyond raw OCR to structured data **Primary Metrics:** F1 Score + Entity F1 (per-entity-type accuracy) **Source:** ICDAR 2019 Competition (https://rrc.cvc.uab.es/?ch=13) ## Metrics The benchmark uses three categories of metrics to evaluate extraction quality. For complete details on each metric including formulas, interpretations, and use case recommendations, see the **[Metrics Reference](metrics-reference.md)**. **Quick summary:** **Set-Based Metrics (Word Finding):** - Precision, Recall, F1 Score - Primary metrics for forms and receipts **Order-Aware Metrics (Sequence Quality):** - LCS Ratio (primary for academic papers) - Word Error Rate (WER) - Sequence Accuracy, Bigram Overlap **Entity Metrics (Semantic Extraction):** - Entity F1 (for SROIE receipts) - Per-Type F1 (date, total, company, address) ## Scoring Strategy ### Per-Category Scores (Primary) Each category reports its primary metric independently: - **Forms:** F1 Score - **Academic:** LCS Ratio - **Receipts:** F1 Score + Entity F1 This allows comparing pipelines within each category without conflating different document types. ### Weighted Aggregate (Optional) For quick overall comparison, an optional weighted aggregate combines category scores: ``` Aggregate = 0.40 × Forms F1 + 0.35 × Academic LCS + 0.25 × Receipts F1 ``` Weights are configurable in benchmark configuration files. Adjust based on your document mix. ## Running Benchmarks ### Benchmark Modes | Mode | Forms | Academic | Receipts | Runtime | |------|-------|----------|----------|---------| | `quick` | 20 docs | 20 docs | 50 docs | ~5-10 min | | `standard` | 50 docs | 100 docs | 100 docs | ~30-60 min | | `full` | 199 docs | All | 626 docs | ~2-4 hours | ### Command Examples ```bash # Quick benchmark for development iteration biblicus benchmark run --config configs/benchmark/quick.yaml # Standard benchmark (default) biblicus benchmark run # Full benchmark for release validation biblicus benchmark run --config configs/benchmark/full.yaml # Single category biblicus benchmark run --category forms # Specific pipelines biblicus benchmark run --pipelines paddleocr,heron-tesseract,baseline-ocr # Check dataset status biblicus benchmark status ``` ### Configuration Files Benchmark configurations in `configs/benchmark/`: ```yaml # configs/benchmark/quick.yaml benchmark_name: quick categories: forms: dataset: funsd subset_size: 20 primary_metric: f1_score academic: dataset: scanned-arxiv subset_size: 20 primary_metric: lcs_ratio receipts: dataset: sroie subset_size: 50 primary_metric: f1_score pipelines: - configs/baseline-ocr.yaml - configs/ocr-paddleocr.yaml - configs/heron-tesseract.yaml aggregate_weights: forms: 0.40 academic: 0.35 receipts: 0.25 ``` ## Understanding Results ### JSON Output Structure ```json { "benchmark_version": "1.0.0", "timestamp": "2026-02-05T12:00:00Z", "categories": { "forms": { "dataset": "funsd", "documents_evaluated": 50, "pipelines": [ { "name": "paddleocr", "metrics": { "f1": 0.787, "recall": 0.782, "precision": 0.792, "wer": 0.533 } } ], "best_pipeline": "paddleocr" }, "academic": { "dataset": "scanned-arxiv", "pipelines": [...], "best_pipeline": "heron-tesseract" }, "receipts": { "dataset": "sroie", "pipelines": [...], "best_pipeline": "paddleocr" } }, "aggregate": { "weighted_score": 0.72, "weights": {"forms": 0.40, "academic": 0.35, "receipts": 0.25} }, "recommendations": { "best_overall": "paddleocr", "best_for_layout": "heron-tesseract", "best_for_speed": "rapidocr" } } ``` ### Interpreting Trade-offs **High Recall, Lower Precision (e.g., Heron + Tesseract)** - Finds more words but includes more noise - Best when missing content is costly - Use for: Completeness-critical applications, legal discovery **High Precision, Lower Recall (e.g., Docling-Smol)** - Fewer false positives but may miss some text - Best when accuracy matters more than completeness - Use for: Automated data entry, structured extraction **High LCS Ratio (e.g., Layout-Aware Pipelines)** - Preserves reading order in multi-column documents - May have higher WER due to region boundary effects - Use for: Academic papers, newspapers, reports ## Pipeline Recommendations Based on benchmark results: | Use Case | Recommended Pipeline | Why | |----------|---------------------|-----| | **General accuracy** | PaddleOCR | Highest F1 across document types | | **Multi-column documents** | Heron + Tesseract | Best reading order preservation | | **Receipts/forms** | PaddleOCR | Strong entity extraction | | **Speed priority** | RapidOCR | Fastest inference, acceptable accuracy | | **Completeness critical** | Heron + Tesseract | Highest recall (0.810) | | **Low noise tolerance** | Docling-Smol | Highest precision | ## Adding Custom Pipelines To benchmark your own pipeline: 1. Create a pipeline configuration in `configs/`: ```yaml # configs/my-custom-pipeline.yaml extractor_id: pipeline config: stages: - extractor_id: my-layout-detector config: threshold: 0.7 - extractor_id: ocr-tesseract config: use_layout_metadata: true ``` 2. Run benchmark with your pipeline: ```bash biblicus benchmark run --pipelines my-custom-pipeline ``` 3. Compare results: ```bash biblicus benchmark report --input results/benchmark_*.json ``` ## Dataset Downloads ### Automatic Download ```bash # Download all datasets biblicus benchmark download --datasets funsd,sroie,scanned-arxiv # Download specific dataset biblicus benchmark download --datasets sroie ``` ### Manual Download If automatic download fails: **FUNSD:** 1. Visit https://guillaumejaume.github.io/FUNSD/ 2. Download `dataset.zip` 3. Extract to `corpora/funsd_benchmark/` **SROIE:** 1. Register at https://rrc.cvc.uab.es/?ch=13 2. Download task files 3. Run `python scripts/download_sroie_samples.py --from-local /path/to/sroie` **Scanned ArXiv:** ```python from datasets import load_dataset dataset = load_dataset("IAMJB/scanned-arxiv-papers") ``` ## Licensing | Dataset | License | Usage | |---------|---------|-------| | FUNSD | CC BY-NC-SA 4.0 | Non-commercial research | | SROIE | Research only | ICDAR competition terms | | Scanned ArXiv | Varies | Check per-paper license | ## See Also - [OCR Benchmarking Guide](ocr-benchmarking.md) - Detailed OCR pipeline evaluation - [Benchmark Results](benchmark-results.md) - Current benchmark results - [Heron Implementation](heron-implementation.md) - Layout detection details - [Extractors Overview](../extractors/index.md) - Available extraction pipelines