Benchmarking Overview

Biblicus is designed as a retrieval augmented generation platform where you can experiment with different extraction pipelines, retrieval backends, and configurations—then benchmark them against each other to find the best approach for your use case.

Why Benchmark?

Different documents require different approaches:

Forms need accurate field extraction and noise handling
Receipts require dense text recognition and entity extraction
Academic papers demand proper reading order across multi-column layouts
Handwritten content benefits from specialized OCR models

Biblicus lets you:

Compare extraction pipelines (Tesseract, PaddleOCR, Docling VLMs, etc.)
Evaluate retrieval backends (scan, SQLite FTS, TF-vector)
Measure with comprehensive metrics (F1 score, WER, sequence accuracy, n-gram overlap)
Reproduce results with snapshot-based evaluation

Benchmarking as a Platform

Rather than providing a single “best” configuration, Biblicus provides:

Multiple extraction pipelines with different speed/accuracy trade-offs
Standardized benchmark datasets (FUNSD forms, SROIE receipts, Scanned ArXiv papers)
Comprehensive metrics covering both accuracy and reading order
Reproducible workflows with configuration files and snapshot IDs

You can:

Benchmark any aspect: extraction, retrieval, or end-to-end analysis
Add custom pipelines and evaluate them against existing ones
Use quick/standard/full benchmark modes for development vs. validation
Export results to JSON or CSV for further analysis

Current Benchmarks

Document Extraction (OCR) Benchmarks

Biblicus includes multi-category benchmarks for document extraction:

1. Forms (FUNSD Dataset)

199 scanned form documents with handwriting and noise
Primary metric: F1 Score
Tests field extraction, layout understanding, noise handling

2. Receipts (SROIE Dataset)

626 receipt images with dense text and small fonts
Primary metric: F1 Score
Tests entity extraction, dense text recognition

3. Academic Papers (Scanned ArXiv) (dataset pending)

Multi-column academic papers
Primary metric: LCS Ratio (reading order preservation)
Tests complex layout understanding and reading order

Speech-to-Text (STT) Benchmarks

Biblicus includes STT provider benchmarks:

1. LibriSpeech test-clean

5.4 hours of read English speech (~2600 utterances)
Primary metric: Word Error Rate (WER)
Tests transcription accuracy, punctuation, and formatting
Providers: OpenAI Whisper, Deepgram Nova-3, Aldea

Evaluated Pipelines

We benchmark 8+ extraction pipelines including:

Tesseract (baseline)
PaddleOCR (high accuracy)
RapidOCR (lightweight)
Docling VLMs (SmolDocling, Granite)
Layout-aware approaches (Heron + Tesseract, PaddleOCR layout + Tesseract)
Unstructured.io parser
MarkItDown

See the Pipeline Catalog for detailed descriptions and configurations.

Getting Started

Quick Start (5-10 minutes)

Run a quick benchmark on a subset of documents:

python scripts/benchmark_all_pipelines.py \
  --corpus-path corpora/funsd \
  --config configs/benchmark/quick.yaml \
  --output results/quick_benchmark.json

See Quickstart Guide for step-by-step instructions.

Understanding Metrics

Biblicus measures extraction quality using three categories of metrics:

Set-based metrics: Precision, Recall, F1 Score (position-agnostic)
Order-aware metrics: WER, Sequence Accuracy, LCS Ratio (reading order quality)
N-gram overlap: Bigram and trigram overlap (local ordering)

See Metrics Reference for detailed explanations.

Current Results

For the latest benchmark results and recommendations by use case:

→ Current Benchmark Results

Documentation Structure

This documentation follows a hub-and-spoke model:

Core Guides:

Quickstart Guide: Step-by-step instructions for running benchmarks
Pipeline Catalog: All available extraction pipelines with configurations
Metrics Reference: Detailed metric definitions and interpretation
Current Results: Latest benchmark findings and recommendations

Deep Dives:

OCR Benchmarking Guide: Practical how-to for OCR evaluation
STT Benchmarking Guide: Practical how-to for STT evaluation
Multi-Category Benchmark Framework: Architecture and design
Heron Implementation: Layout detection specifics
Layout-Aware OCR Results: Detailed layout-aware analysis

Benchmark Modes

Biblicus supports three benchmark modes to balance speed vs. thoroughness:

OCR Benchmarks

Mode	Duration	Use Case	Configuration
Quick	5-10 min	Development iteration	`configs/benchmark/quick.yaml`
Standard	30-60 min	Release validation	`configs/benchmark/standard.yaml`
Full	2-4 hours	Comprehensive evaluation	`configs/benchmark/full.yaml`

STT Benchmarks

Mode	Duration	Audio Files	Use Case	Configuration
Quick	2-5 min	20	Development iteration	`configs/benchmark/stt-quick.yaml`
Standard	10-20 min	100	Release validation	`configs/benchmark/stt-standard.yaml`
Full	2-4 hours	~2600	Comprehensive evaluation	`configs/benchmark/stt-full.yaml`

Customization

Adding Custom Pipelines

You can add your own extraction pipelines to benchmark:

Create a pipeline configuration in configs/
Add it to the benchmark runner
Run the benchmark to compare against existing pipelines

See Pipeline Catalog for examples.

Adding Custom Benchmarks

The benchmark framework is extensible:

Create a new CategoryConfig with your dataset
Define the primary metric for your document type
Add it to the BenchmarkRunner

See Multi-Category Benchmark Framework for details.

Architecture

Biblicus uses a three-tier benchmarking system:

Tier 1: Individual Document Evaluation

OCREvaluationResult for single document metrics
BenchmarkReport for aggregate metrics

Tier 2: Multi-Category Orchestration

CategoryConfig defines document categories
CategoryResult aggregates per-category results
BenchmarkResult provides multi-category aggregation

Tier 3: User-Facing Scripts

benchmark_all_pipelines.py - Compare all pipelines
benchmark_heron_vs_paddleocr.py - Direct comparison
quick_benchmark_layout_aware.py - Validate specific workflow

See Multi-Category Benchmark Framework for architectural details.

Next Steps

Run your first benchmark - 5-minute quickstart
Explore pipelines - See what’s available
Understand metrics - Learn how quality is measured
Review current results - See how pipelines compare

Contributing

To add new benchmarks or pipelines:

Follow the existing patterns in src/biblicus/evaluation/
Add documentation to the appropriate guide
Update this overview to link to your additions
Submit a pull request

For questions or issues, see the main repository.