Benchmarking Overview
Biblicus is designed as a retrieval augmented generation platform where you can experiment with different extraction pipelines, retrieval backends, and configurations—then benchmark them against each other to find the best approach for your use case.
Why Benchmark?
Different documents require different approaches:
Forms need accurate field extraction and noise handling
Receipts require dense text recognition and entity extraction
Academic papers demand proper reading order across multi-column layouts
Handwritten content benefits from specialized OCR models
Biblicus lets you:
Compare extraction pipelines (Tesseract, PaddleOCR, Docling VLMs, etc.)
Evaluate retrieval backends (scan, SQLite FTS, TF-vector)
Measure with comprehensive metrics (F1 score, WER, sequence accuracy, n-gram overlap)
Reproduce results with snapshot-based evaluation
Benchmarking as a Platform
Rather than providing a single “best” configuration, Biblicus provides:
Multiple extraction pipelines with different speed/accuracy trade-offs
Standardized benchmark datasets (FUNSD forms, SROIE receipts, Scanned ArXiv papers)
Comprehensive metrics covering both accuracy and reading order
Reproducible workflows with configuration files and snapshot IDs
You can:
Benchmark any aspect: extraction, retrieval, or end-to-end analysis
Add custom pipelines and evaluate them against existing ones
Use quick/standard/full benchmark modes for development vs. validation
Export results to JSON or CSV for further analysis
Current Benchmarks
Document Extraction (OCR) Benchmarks
Biblicus includes multi-category benchmarks for document extraction:
1. Forms (FUNSD Dataset)
199 scanned form documents with handwriting and noise
Primary metric: F1 Score
Tests field extraction, layout understanding, noise handling
2. Receipts (SROIE Dataset)
626 receipt images with dense text and small fonts
Primary metric: F1 Score
Tests entity extraction, dense text recognition
3. Academic Papers (Scanned ArXiv) (dataset pending)
Multi-column academic papers
Primary metric: LCS Ratio (reading order preservation)
Tests complex layout understanding and reading order
Speech-to-Text (STT) Benchmarks
Biblicus includes STT provider benchmarks:
1. LibriSpeech test-clean
5.4 hours of read English speech (~2600 utterances)
Primary metric: Word Error Rate (WER)
Tests transcription accuracy, punctuation, and formatting
Providers: OpenAI Whisper, Deepgram Nova-3, Aldea
Evaluated Pipelines
We benchmark 8+ extraction pipelines including:
Tesseract (baseline)
PaddleOCR (high accuracy)
RapidOCR (lightweight)
Docling VLMs (SmolDocling, Granite)
Layout-aware approaches (Heron + Tesseract, PaddleOCR layout + Tesseract)
Unstructured.io parser
MarkItDown
See the Pipeline Catalog for detailed descriptions and configurations.
Getting Started
Quick Start (5-10 minutes)
Run a quick benchmark on a subset of documents:
python scripts/benchmark_all_pipelines.py \
--corpus-path corpora/funsd \
--config configs/benchmark/quick.yaml \
--output results/quick_benchmark.json
See Quickstart Guide for step-by-step instructions.
Understanding Metrics
Biblicus measures extraction quality using three categories of metrics:
Set-based metrics: Precision, Recall, F1 Score (position-agnostic)
Order-aware metrics: WER, Sequence Accuracy, LCS Ratio (reading order quality)
N-gram overlap: Bigram and trigram overlap (local ordering)
See Metrics Reference for detailed explanations.
Current Results
For the latest benchmark results and recommendations by use case:
Documentation Structure
This documentation follows a hub-and-spoke model:
Core Guides:
Quickstart Guide: Step-by-step instructions for running benchmarks
Pipeline Catalog: All available extraction pipelines with configurations
Metrics Reference: Detailed metric definitions and interpretation
Current Results: Latest benchmark findings and recommendations
Deep Dives:
OCR Benchmarking Guide: Practical how-to for OCR evaluation
STT Benchmarking Guide: Practical how-to for STT evaluation
Multi-Category Benchmark Framework: Architecture and design
Heron Implementation: Layout detection specifics
Layout-Aware OCR Results: Detailed layout-aware analysis
Benchmark Modes
Biblicus supports three benchmark modes to balance speed vs. thoroughness:
OCR Benchmarks
Mode |
Duration |
Use Case |
Configuration |
|---|---|---|---|
Quick |
5-10 min |
Development iteration |
|
Standard |
30-60 min |
Release validation |
|
Full |
2-4 hours |
Comprehensive evaluation |
|
STT Benchmarks
Mode |
Duration |
Audio Files |
Use Case |
Configuration |
|---|---|---|---|---|
Quick |
2-5 min |
20 |
Development iteration |
|
Standard |
10-20 min |
100 |
Release validation |
|
Full |
2-4 hours |
~2600 |
Comprehensive evaluation |
|
Customization
Adding Custom Pipelines
You can add your own extraction pipelines to benchmark:
Create a pipeline configuration in
configs/Add it to the benchmark runner
Run the benchmark to compare against existing pipelines
See Pipeline Catalog for examples.
Adding Custom Benchmarks
The benchmark framework is extensible:
Create a new
CategoryConfigwith your datasetDefine the primary metric for your document type
Add it to the
BenchmarkRunner
See Multi-Category Benchmark Framework for details.
Architecture
Biblicus uses a three-tier benchmarking system:
Tier 1: Individual Document Evaluation
OCREvaluationResultfor single document metricsBenchmarkReportfor aggregate metrics
Tier 2: Multi-Category Orchestration
CategoryConfigdefines document categoriesCategoryResultaggregates per-category resultsBenchmarkResultprovides multi-category aggregation
Tier 3: User-Facing Scripts
benchmark_all_pipelines.py- Compare all pipelinesbenchmark_heron_vs_paddleocr.py- Direct comparisonquick_benchmark_layout_aware.py- Validate specific workflow
See Multi-Category Benchmark Framework for architectural details.
Next Steps
Run your first benchmark - 5-minute quickstart
Explore pipelines - See what’s available
Understand metrics - Learn how quality is measured
Review current results - See how pipelines compare
Contributing
To add new benchmarks or pipelines:
Follow the existing patterns in
src/biblicus/evaluation/Add documentation to the appropriate guide
Update this overview to link to your additions
Submit a pull request
For questions or issues, see the main repository.