Pipeline Catalog

Complete reference of all extraction pipelines available for benchmarking in Biblicus.

Overview

Biblicus includes 8+ pre-configured extraction pipelines with different speed/accuracy trade-offs. Each pipeline is defined in a YAML configuration file under configs/.

Pipeline Comparison Table

Pipeline	F1 Score	Recall	Speed	Use Case
PaddleOCR	0.787	0.782	Medium	Best overall accuracy
Docling-Smol	0.728	0.675	Slow	Tables & formulas
Unstructured	0.649	0.626	Medium	General documents
Baseline Tesseract	0.607	0.599	Fast	Simple baseline
Layout-Aware Tesseract (PaddleOCR)	0.601	0.732	Medium	High recall needs
Heron + Tesseract	0.519	0.810	Slow	Maximum extraction
RapidOCR	0.507	0.467	Fast	Lightweight/embedded

Basic OCR Pipelines

1. Baseline Tesseract

Simple Tesseract OCR without layout detection.

Configuration: configs/baseline-ocr.yaml

extractor_id: ocr-tesseract
config:
  min_confidence: 0.0
  lang: eng

Performance (FUNSD Forms):

F1 Score: 0.607
Recall: 0.599
Precision: 0.615

Strengths:

Fast processing
Minimal dependencies
Good baseline for comparison

Weaknesses:

No layout understanding
Struggles with complex formatting
Lower accuracy on forms

Best for: Simple documents, baseline comparisons, speed-critical applications

2. PaddleOCR

PaddleOCR with VL model - best overall performer.

Configuration: configs/ocr-paddleocr.yaml

extractor_id: ocr-paddleocr-vl
config:
  lang: en

Performance (FUNSD Forms):

F1 Score: 0.787 ⭐ BEST F1
Recall: 0.782
Precision: 0.792
Word Error Rate: 0.533

Strengths:

Highest F1 score across benchmarks
Built-in layout detection
Good balance of precision and recall
Handles complex layouts

Weaknesses:

Requires PaddleOCR dependencies
Slower than Tesseract
Higher memory usage

Best for: Production systems, complex documents, when accuracy matters most

Installation:

pip install "biblicus[paddleocr]"

3. Docling-Smol

Docling with SmolDocling-256M vision-language model for document understanding.

Configuration: configs/docling-smol.yaml

extractor_id: docling-smol
config:
  output_format: markdown

Performance (FUNSD Forms):

F1 Score: 0.728
Recall: 0.675
Precision: 0.788

Strengths:

Advanced VLM-based extraction
Excellent for tables and formulas
Structured output (markdown)
Good semantic understanding

Weaknesses:

Slower processing
Higher resource requirements
May be overkill for simple forms

Best for: Academic papers, technical documents, tables and formulas

Installation:

pip install "biblicus[docling]"

4. RapidOCR

Fast, lightweight OCR library for resource-constrained environments.

Configuration: configs/ocr-rapidocr.yaml

extractor_id: ocr-rapidocr
config:
  use_det: true
  use_cls: true
  use_rec: true

Performance (FUNSD Forms):

F1 Score: 0.507
Recall: 0.467
Precision: 0.556

Strengths:

Very fast processing
Minimal dependencies
Low memory footprint
Good for embedded systems

Weaknesses:

Lower accuracy than PaddleOCR/Tesseract
Limited language support
Simpler text detection

Best for: Real-time applications, edge devices, resource constraints

Installation:

pip install "biblicus[ocr]"

5. Unstructured

Unstructured.io document parser with multi-format support.

Configuration: configs/unstructured.yaml

extractor_id: unstructured
config: {}

Performance (FUNSD Forms):

F1 Score: 0.649
Recall: 0.626
Precision: 0.673

Strengths:

Handles many document formats
Good general-purpose parser
Structured element extraction
PDF, Word, HTML, etc.

Weaknesses:

Heavy dependency footprint
Slower than specialized OCR
May be overkill for images only

Best for: Mixed document types, production pipelines handling various formats

Installation:

pip install "biblicus[unstructured]"

8. MarkItDown

Microsoft’s MarkItDown converter for document-to-markdown conversion.

Configuration: configs/markitdown.yaml

extractor_id: markitdown
config: {}

Strengths:

Excellent markdown output
Handles Office documents
Preserves structure

Weaknesses:

Requires Python 3.10+
Not optimized for OCR

Best for: Converting Office documents, markdown workflows

Installation:

pip install "biblicus[markitdown]"

Layout-Aware Pipelines

Layout-aware pipelines use a two-stage approach:

Layout detection to identify document regions and reading order
OCR on each region in sequence

This improves reading order and can increase recall at the cost of precision.

6. Layout-Aware Tesseract (PaddleOCR)

PaddleOCR PP-Structure layout detection → Tesseract OCR.

Configuration: configs/layout-aware-tesseract.yaml

extractor_id: pipeline
config:
  stages:
    - extractor_id: paddleocr-layout
      config:
        lang: en
    - extractor_id: ocr-tesseract
      config:
        use_layout_metadata: true

Performance (FUNSD Forms):

F1 Score: 0.601
Recall: 0.732 (+22.2% vs baseline Tesseract)
Precision: 0.503

Strengths:

Higher recall than baseline Tesseract
Better reading order
Handles multi-column layouts

Weaknesses:

Lower precision (more false positives)
Slower than single-stage
Requires PaddleOCR

Best for: Documents where missing content is costly, complex layouts

Trade-off: Sacrifices precision for recall - finds more text but includes more noise.

7. Heron + Tesseract

IBM Heron-101 layout detection → Tesseract OCR for maximum text extraction.

Configuration: configs/heron-tesseract.yaml

extractor_id: pipeline
config:
  stages:
    - extractor_id: heron-layout
      config:
        model_variant: "101"
        confidence_threshold: 0.6
    - extractor_id: ocr-tesseract
      config:
        use_layout_metadata: true

Performance (FUNSD Forms):

F1 Score: 0.519
Recall: 0.810 ⭐ HIGHEST RECALL
Precision: 0.384
Bigram Overlap: 0.561 (best local ordering)

Strengths:

Finds 81% of all words - more than any other pipeline
Excellent local word ordering (bigrams)
Best for completeness
Strong layout understanding

Weaknesses:

Lowest precision (38.4%)
More false positives/noise
Slower processing
Lower F1 due to precision trade-off

Best for:

Applications where missing content is worse than noise
Documents requiring maximum text extraction
When completeness matters more than accuracy
Legal/compliance where you can’t miss text

Trade-off: Maximum recall at the cost of precision - extracts everything but includes more errors.

See Heron Implementation Guide for detailed information.

Vision-Language Models

Docling-Granite

Docling with IBM Granite Docling-258M VLM for high-accuracy extraction.

Configuration: configs/docling-granite.yaml

extractor_id: docling-granite
config:
  output_format: markdown

Strengths:

Higher accuracy than SmolDocling
Excellent for technical documents
Strong table understanding

Weaknesses:

Slower than SmolDocling
Higher resource requirements

Best for: When maximum VLM accuracy is needed, complex technical documents

Installation:

pip install "biblicus[docling]"

Creating Custom Pipelines

Single-Stage Custom Pipeline

# configs/my-custom-ocr.yaml
extractor_id: ocr-tesseract
config:
  lang: eng
  psm: 6  # Assume uniform block of text
  min_confidence: 0.6
  oem: 3  # Default LSTM engine

Multi-Stage Custom Pipeline

# configs/my-custom-pipeline.yaml
extractor_id: pipeline
config:
  stages:
    # Stage 1: Layout detection
    - extractor_id: heron-layout
      config:
        model_variant: "101"
        confidence_threshold: 0.7

    # Stage 2: OCR
    - extractor_id: ocr-paddleocr-vl
      config:
        use_layout_metadata: true
        lang: en

    # Stage 3: Post-processing (if available)
    - extractor_id: select-longest-text
      config: {}

Testing Your Custom Pipeline

from pathlib import Path
from biblicus import Corpus
from biblicus.evaluation.ocr_benchmark import OCRBenchmark
from biblicus.extraction import build_extraction_snapshot
import yaml

# Load your config
with open("configs/my-custom-pipeline.yaml") as f:
    config = yaml.safe_load(f)

# Build extraction snapshot
corpus = Corpus(Path("corpora/funsd_benchmark"))
snapshot = build_extraction_snapshot(
    corpus,
    extractor_id=config["extractor_id"],
    configuration_name="my-custom-pipeline",
    configuration=config["config"]
)

# Evaluate
benchmark = OCRBenchmark(corpus)
report = benchmark.evaluate_extraction(
    snapshot_reference=snapshot.snapshot_id,
    pipeline_config=config
)

# View results
report.print_summary()

Adding to Benchmark Suite

Edit scripts/benchmark_all_pipelines.py:

PIPELINE_CONFIGS = [
    # ... existing configs ...
    "configs/my-custom-pipeline.yaml",
]

Then run:

python scripts/benchmark_all_pipelines.py

Pipeline Selection Guide

By Use Case

Maximum Accuracy (F1): Use PaddleOCR

Best: Forms, receipts, general documents
F1: 0.787

Maximum Recall (Completeness): Use Heron + Tesseract

Best: Legal, compliance, when missing text is critical
Recall: 0.810

Speed-Critical: Use RapidOCR or Baseline Tesseract

Best: Real-time, embedded systems
Fast processing

Tables & Formulas: Use Docling-Smol or Docling-Granite

Best: Academic papers, technical documents
VLM-based understanding

Multi-Format Documents: Use Unstructured

Best: PDF, Word, HTML, mixed formats
General-purpose parser

By Document Type

Forms (FUNSD-like):

PaddleOCR (F1: 0.787)
Docling-Smol (F1: 0.728)
Layout-Aware Tesseract (F1: 0.601, Recall: 0.732)

Receipts (dense text):

PaddleOCR (best for entity extraction)
Docling-Smol (good structure preservation)

Academic Papers (multi-column):

Docling-Granite (best layout understanding)
Docling-Smol (good tables/formulas)
Heron + Tesseract (strong reading order)

Simple Text Documents:

Baseline Tesseract (fast, sufficient)
RapidOCR (lightweight alternative)

Performance Tuning

Improving Recall

If you’re missing too much text:

Try Heron + Tesseract (highest recall: 0.810)
Lower confidence thresholds
Use layout-aware pipelines
Consider multi-model ensembles

Improving Precision

If you’re getting too much noise:

Use PaddleOCR (best balance)
Increase confidence thresholds
Add post-processing filters
Use VLM-based models for cleaner output

Improving Speed

If processing is too slow:

Use RapidOCR or Tesseract baseline
Reduce image resolution
Skip layout detection stage
Process in parallel batches

Next Steps

Run benchmarks to compare pipelines on your data
Understand metrics to interpret results
View current results to see how pipelines compare
OCR Benchmarking Guide for practical how-to

References

Pipeline configurations: configs/
Benchmark scripts: scripts/benchmark_*.py
Evaluation module: src/biblicus/evaluation/ocr_benchmark.py
Heron Implementation Details
Layout-Aware OCR Analysis