# Benchmarking Overview Biblicus is designed as a **retrieval augmented generation platform** where you can experiment with different extraction pipelines, retrieval backends, and configurations—then benchmark them against each other to find the best approach for your use case. ## Why Benchmark? Different documents require different approaches: - **Forms** need accurate field extraction and noise handling - **Receipts** require dense text recognition and entity extraction - **Academic papers** demand proper reading order across multi-column layouts - **Handwritten content** benefits from specialized OCR models Biblicus lets you: 1. **Compare extraction pipelines** (Tesseract, PaddleOCR, Docling VLMs, etc.) 2. **Evaluate retrieval backends** (scan, SQLite FTS, TF-vector) 3. **Measure with comprehensive metrics** (F1 score, WER, sequence accuracy, n-gram overlap) 4. **Reproduce results** with snapshot-based evaluation ## Benchmarking as a Platform Rather than providing a single "best" configuration, Biblicus provides: - **Multiple extraction pipelines** with different speed/accuracy trade-offs - **Standardized benchmark datasets** (FUNSD forms, SROIE receipts, Scanned ArXiv papers) - **Comprehensive metrics** covering both accuracy and reading order - **Reproducible workflows** with configuration files and snapshot IDs You can: - Benchmark any aspect: extraction, retrieval, or end-to-end analysis - Add custom pipelines and evaluate them against existing ones - Use quick/standard/full benchmark modes for development vs. validation - Export results to JSON or CSV for further analysis ## Current Benchmarks ### Document Extraction (OCR) Benchmarks Biblicus includes multi-category benchmarks for document extraction: **1. Forms (FUNSD Dataset)** - 199 scanned form documents with handwriting and noise - Primary metric: F1 Score - Tests field extraction, layout understanding, noise handling **2. Receipts (SROIE Dataset)** - 626 receipt images with dense text and small fonts - Primary metric: F1 Score - Tests entity extraction, dense text recognition **3. Academic Papers (Scanned ArXiv)** *(dataset pending)* - Multi-column academic papers - Primary metric: LCS Ratio (reading order preservation) - Tests complex layout understanding and reading order ### Speech-to-Text (STT) Benchmarks Biblicus includes STT provider benchmarks: **1. LibriSpeech test-clean** - 5.4 hours of read English speech (~2600 utterances) - Primary metric: Word Error Rate (WER) - Tests transcription accuracy, punctuation, and formatting - Providers: OpenAI Whisper, Deepgram Nova-3, Aldea ### Evaluated Pipelines We benchmark 8+ extraction pipelines including: - Tesseract (baseline) - PaddleOCR (high accuracy) - RapidOCR (lightweight) - Docling VLMs (SmolDocling, Granite) - Layout-aware approaches (Heron + Tesseract, PaddleOCR layout + Tesseract) - Unstructured.io parser - MarkItDown See the [Pipeline Catalog](pipeline-catalog.md) for detailed descriptions and configurations. ## Getting Started ### Quick Start (5-10 minutes) Run a quick benchmark on a subset of documents: ```bash python scripts/benchmark_all_pipelines.py \ --corpus-path corpora/funsd \ --config configs/benchmark/quick.yaml \ --output results/quick_benchmark.json ``` See [Quickstart Guide](quickstart-benchmarking.md) for step-by-step instructions. ### Understanding Metrics Biblicus measures extraction quality using three categories of metrics: - **Set-based metrics**: Precision, Recall, F1 Score (position-agnostic) - **Order-aware metrics**: WER, Sequence Accuracy, LCS Ratio (reading order quality) - **N-gram overlap**: Bigram and trigram overlap (local ordering) See [Metrics Reference](metrics-reference.md) for detailed explanations. ### Current Results For the latest benchmark results and recommendations by use case: → [Current Benchmark Results](benchmark-results.md) ## Documentation Structure This documentation follows a hub-and-spoke model: **Core Guides:** - **[Quickstart Guide](quickstart-benchmarking.md)**: Step-by-step instructions for running benchmarks - **[Pipeline Catalog](pipeline-catalog.md)**: All available extraction pipelines with configurations - **[Metrics Reference](metrics-reference.md)**: Detailed metric definitions and interpretation - **[Current Results](benchmark-results.md)**: Latest benchmark findings and recommendations **Deep Dives:** - **[OCR Benchmarking Guide](ocr-benchmarking.md)**: Practical how-to for OCR evaluation - **[STT Benchmarking Guide](stt-benchmarking.md)**: Practical how-to for STT evaluation - **[Multi-Category Benchmark Framework](document-understanding-benchmark.md)**: Architecture and design - **[Heron Implementation](heron-implementation.md)**: Layout detection specifics - **[Layout-Aware OCR Results](layout-aware-ocr-results.md)**: Detailed layout-aware analysis ## Benchmark Modes Biblicus supports three benchmark modes to balance speed vs. thoroughness: ### OCR Benchmarks | Mode | Duration | Use Case | Configuration | |------|----------|----------|---------------| | **Quick** | 5-10 min | Development iteration | `configs/benchmark/quick.yaml` | | **Standard** | 30-60 min | Release validation | `configs/benchmark/standard.yaml` | | **Full** | 2-4 hours | Comprehensive evaluation | `configs/benchmark/full.yaml` | ### STT Benchmarks | Mode | Duration | Audio Files | Use Case | Configuration | |------|----------|-------------|----------|---------------| | **Quick** | 2-5 min | 20 | Development iteration | `configs/benchmark/stt-quick.yaml` | | **Standard** | 10-20 min | 100 | Release validation | `configs/benchmark/stt-standard.yaml` | | **Full** | 2-4 hours | ~2600 | Comprehensive evaluation | `configs/benchmark/stt-full.yaml` | ## Customization ### Adding Custom Pipelines You can add your own extraction pipelines to benchmark: 1. Create a pipeline configuration in `configs/` 2. Add it to the benchmark runner 3. Run the benchmark to compare against existing pipelines See [Pipeline Catalog](pipeline-catalog.md) for examples. ### Adding Custom Benchmarks The benchmark framework is extensible: 1. Create a new `CategoryConfig` with your dataset 2. Define the primary metric for your document type 3. Add it to the `BenchmarkRunner` See [Multi-Category Benchmark Framework](document-understanding-benchmark.md) for details. ## Architecture Biblicus uses a three-tier benchmarking system: **Tier 1: Individual Document Evaluation** - `OCREvaluationResult` for single document metrics - `BenchmarkReport` for aggregate metrics **Tier 2: Multi-Category Orchestration** - `CategoryConfig` defines document categories - `CategoryResult` aggregates per-category results - `BenchmarkResult` provides multi-category aggregation **Tier 3: User-Facing Scripts** - `benchmark_all_pipelines.py` - Compare all pipelines - `benchmark_heron_vs_paddleocr.py` - Direct comparison - `quick_benchmark_layout_aware.py` - Validate specific workflow See [Multi-Category Benchmark Framework](document-understanding-benchmark.md) for architectural details. ## Next Steps 1. **[Run your first benchmark](quickstart-benchmarking.md)** - 5-minute quickstart 2. **[Explore pipelines](pipeline-catalog.md)** - See what's available 3. **[Understand metrics](metrics-reference.md)** - Learn how quality is measured 4. **[Review current results](benchmark-results.md)** - See how pipelines compare ## Contributing To add new benchmarks or pipelines: 1. Follow the existing patterns in `src/biblicus/evaluation/` 2. Add documentation to the appropriate guide 3. Update this overview to link to your additions 4. Submit a pull request For questions or issues, see the [main repository](https://github.com/AnthusAI/Biblicus).