Benchmarking Overview

Biblicus is designed as a retrieval augmented generation platform where you can experiment with different extraction pipelines, retrieval backends, and configurations—then benchmark them against each other to find the best approach for your use case.

Why Benchmark?

Different documents require different approaches:

  • Forms need accurate field extraction and noise handling

  • Receipts require dense text recognition and entity extraction

  • Academic papers demand proper reading order across multi-column layouts

  • Handwritten content benefits from specialized OCR models

Biblicus lets you:

  1. Compare extraction pipelines (Tesseract, PaddleOCR, Docling VLMs, etc.)

  2. Evaluate retrieval backends (scan, SQLite FTS, TF-vector)

  3. Measure with comprehensive metrics (F1 score, WER, sequence accuracy, n-gram overlap)

  4. Reproduce results with snapshot-based evaluation

Benchmarking as a Platform

Rather than providing a single “best” configuration, Biblicus provides:

  • Multiple extraction pipelines with different speed/accuracy trade-offs

  • Standardized benchmark datasets (FUNSD forms, SROIE receipts, Scanned ArXiv papers)

  • Comprehensive metrics covering both accuracy and reading order

  • Reproducible workflows with configuration files and snapshot IDs

You can:

  • Benchmark any aspect: extraction, retrieval, or end-to-end analysis

  • Add custom pipelines and evaluate them against existing ones

  • Use quick/standard/full benchmark modes for development vs. validation

  • Export results to JSON or CSV for further analysis

Current Benchmarks

Document Extraction (OCR) Benchmarks

Biblicus includes multi-category benchmarks for document extraction:

1. Forms (FUNSD Dataset)

  • 199 scanned form documents with handwriting and noise

  • Primary metric: F1 Score

  • Tests field extraction, layout understanding, noise handling

2. Receipts (SROIE Dataset)

  • 626 receipt images with dense text and small fonts

  • Primary metric: F1 Score

  • Tests entity extraction, dense text recognition

3. Academic Papers (Scanned ArXiv) (dataset pending)

  • Multi-column academic papers

  • Primary metric: LCS Ratio (reading order preservation)

  • Tests complex layout understanding and reading order

Speech-to-Text (STT) Benchmarks

Biblicus includes STT provider benchmarks:

1. LibriSpeech test-clean

  • 5.4 hours of read English speech (~2600 utterances)

  • Primary metric: Word Error Rate (WER)

  • Tests transcription accuracy, punctuation, and formatting

  • Providers: OpenAI Whisper, Deepgram Nova-3, Aldea

Evaluated Pipelines

We benchmark 8+ extraction pipelines including:

  • Tesseract (baseline)

  • PaddleOCR (high accuracy)

  • RapidOCR (lightweight)

  • Docling VLMs (SmolDocling, Granite)

  • Layout-aware approaches (Heron + Tesseract, PaddleOCR layout + Tesseract)

  • Unstructured.io parser

  • MarkItDown

See the Pipeline Catalog for detailed descriptions and configurations.

Getting Started

Quick Start (5-10 minutes)

Run a quick benchmark on a subset of documents:

python scripts/benchmark_all_pipelines.py \
  --corpus-path corpora/funsd \
  --config configs/benchmark/quick.yaml \
  --output results/quick_benchmark.json

See Quickstart Guide for step-by-step instructions.

Understanding Metrics

Biblicus measures extraction quality using three categories of metrics:

  • Set-based metrics: Precision, Recall, F1 Score (position-agnostic)

  • Order-aware metrics: WER, Sequence Accuracy, LCS Ratio (reading order quality)

  • N-gram overlap: Bigram and trigram overlap (local ordering)

See Metrics Reference for detailed explanations.

Current Results

For the latest benchmark results and recommendations by use case:

Current Benchmark Results

Documentation Structure

This documentation follows a hub-and-spoke model:

Core Guides:

Deep Dives:

Benchmark Modes

Biblicus supports three benchmark modes to balance speed vs. thoroughness:

OCR Benchmarks

Mode

Duration

Use Case

Configuration

Quick

5-10 min

Development iteration

configs/benchmark/quick.yaml

Standard

30-60 min

Release validation

configs/benchmark/standard.yaml

Full

2-4 hours

Comprehensive evaluation

configs/benchmark/full.yaml

STT Benchmarks

Mode

Duration

Audio Files

Use Case

Configuration

Quick

2-5 min

20

Development iteration

configs/benchmark/stt-quick.yaml

Standard

10-20 min

100

Release validation

configs/benchmark/stt-standard.yaml

Full

2-4 hours

~2600

Comprehensive evaluation

configs/benchmark/stt-full.yaml

Customization

Adding Custom Pipelines

You can add your own extraction pipelines to benchmark:

  1. Create a pipeline configuration in configs/

  2. Add it to the benchmark runner

  3. Run the benchmark to compare against existing pipelines

See Pipeline Catalog for examples.

Adding Custom Benchmarks

The benchmark framework is extensible:

  1. Create a new CategoryConfig with your dataset

  2. Define the primary metric for your document type

  3. Add it to the BenchmarkRunner

See Multi-Category Benchmark Framework for details.

Architecture

Biblicus uses a three-tier benchmarking system:

Tier 1: Individual Document Evaluation

  • OCREvaluationResult for single document metrics

  • BenchmarkReport for aggregate metrics

Tier 2: Multi-Category Orchestration

  • CategoryConfig defines document categories

  • CategoryResult aggregates per-category results

  • BenchmarkResult provides multi-category aggregation

Tier 3: User-Facing Scripts

  • benchmark_all_pipelines.py - Compare all pipelines

  • benchmark_heron_vs_paddleocr.py - Direct comparison

  • quick_benchmark_layout_aware.py - Validate specific workflow

See Multi-Category Benchmark Framework for architectural details.

Next Steps

  1. Run your first benchmark - 5-minute quickstart

  2. Explore pipelines - See what’s available

  3. Understand metrics - Learn how quality is measured

  4. Review current results - See how pipelines compare

Contributing

To add new benchmarks or pipelines:

  1. Follow the existing patterns in src/biblicus/evaluation/

  2. Add documentation to the appropriate guide

  3. Update this overview to link to your additions

  4. Submit a pull request

For questions or issues, see the main repository.