Pipeline Catalog

Complete reference of all extraction pipelines available for benchmarking in Biblicus.

Overview

Biblicus includes 8+ pre-configured extraction pipelines with different speed/accuracy trade-offs. Each pipeline is defined in a YAML configuration file under configs/.

Pipeline Comparison Table

Pipeline

F1 Score

Recall

Speed

Use Case

PaddleOCR

0.787

0.782

Medium

Best overall accuracy

Docling-Smol

0.728

0.675

Slow

Tables & formulas

Unstructured

0.649

0.626

Medium

General documents

Baseline Tesseract

0.607

0.599

Fast

Simple baseline

Layout-Aware Tesseract (PaddleOCR)

0.601

0.732

Medium

High recall needs

Heron + Tesseract

0.519

0.810

Slow

Maximum extraction

RapidOCR

0.507

0.467

Fast

Lightweight/embedded

Basic OCR Pipelines

1. Baseline Tesseract

Simple Tesseract OCR without layout detection.

Configuration: configs/baseline-ocr.yaml

extractor_id: ocr-tesseract
config:
  min_confidence: 0.0
  lang: eng

Performance (FUNSD Forms):

  • F1 Score: 0.607

  • Recall: 0.599

  • Precision: 0.615

Strengths:

  • Fast processing

  • Minimal dependencies

  • Good baseline for comparison

Weaknesses:

  • No layout understanding

  • Struggles with complex formatting

  • Lower accuracy on forms

Best for: Simple documents, baseline comparisons, speed-critical applications


2. PaddleOCR

PaddleOCR with VL model - best overall performer.

Configuration: configs/ocr-paddleocr.yaml

extractor_id: ocr-paddleocr-vl
config:
  lang: en

Performance (FUNSD Forms):

  • F1 Score: 0.787BEST F1

  • Recall: 0.782

  • Precision: 0.792

  • Word Error Rate: 0.533

Strengths:

  • Highest F1 score across benchmarks

  • Built-in layout detection

  • Good balance of precision and recall

  • Handles complex layouts

Weaknesses:

  • Requires PaddleOCR dependencies

  • Slower than Tesseract

  • Higher memory usage

Best for: Production systems, complex documents, when accuracy matters most

Installation:

pip install "biblicus[paddleocr]"

3. Docling-Smol

Docling with SmolDocling-256M vision-language model for document understanding.

Configuration: configs/docling-smol.yaml

extractor_id: docling-smol
config:
  output_format: markdown

Performance (FUNSD Forms):

  • F1 Score: 0.728

  • Recall: 0.675

  • Precision: 0.788

Strengths:

  • Advanced VLM-based extraction

  • Excellent for tables and formulas

  • Structured output (markdown)

  • Good semantic understanding

Weaknesses:

  • Slower processing

  • Higher resource requirements

  • May be overkill for simple forms

Best for: Academic papers, technical documents, tables and formulas

Installation:

pip install "biblicus[docling]"

4. RapidOCR

Fast, lightweight OCR library for resource-constrained environments.

Configuration: configs/ocr-rapidocr.yaml

extractor_id: ocr-rapidocr
config:
  use_det: true
  use_cls: true
  use_rec: true

Performance (FUNSD Forms):

  • F1 Score: 0.507

  • Recall: 0.467

  • Precision: 0.556

Strengths:

  • Very fast processing

  • Minimal dependencies

  • Low memory footprint

  • Good for embedded systems

Weaknesses:

  • Lower accuracy than PaddleOCR/Tesseract

  • Limited language support

  • Simpler text detection

Best for: Real-time applications, edge devices, resource constraints

Installation:

pip install "biblicus[ocr]"

5. Unstructured

Unstructured.io document parser with multi-format support.

Configuration: configs/unstructured.yaml

extractor_id: unstructured
config: {}

Performance (FUNSD Forms):

  • F1 Score: 0.649

  • Recall: 0.626

  • Precision: 0.673

Strengths:

  • Handles many document formats

  • Good general-purpose parser

  • Structured element extraction

  • PDF, Word, HTML, etc.

Weaknesses:

  • Heavy dependency footprint

  • Slower than specialized OCR

  • May be overkill for images only

Best for: Mixed document types, production pipelines handling various formats

Installation:

pip install "biblicus[unstructured]"

8. MarkItDown

Microsoft’s MarkItDown converter for document-to-markdown conversion.

Configuration: configs/markitdown.yaml

extractor_id: markitdown
config: {}

Strengths:

  • Excellent markdown output

  • Handles Office documents

  • Preserves structure

Weaknesses:

  • Requires Python 3.10+

  • Not optimized for OCR

Best for: Converting Office documents, markdown workflows

Installation:

pip install "biblicus[markitdown]"

Layout-Aware Pipelines

Layout-aware pipelines use a two-stage approach:

  1. Layout detection to identify document regions and reading order

  2. OCR on each region in sequence

This improves reading order and can increase recall at the cost of precision.

6. Layout-Aware Tesseract (PaddleOCR)

PaddleOCR PP-Structure layout detection → Tesseract OCR.

Configuration: configs/layout-aware-tesseract.yaml

extractor_id: pipeline
config:
  stages:
    - extractor_id: paddleocr-layout
      config:
        lang: en
    - extractor_id: ocr-tesseract
      config:
        use_layout_metadata: true

Performance (FUNSD Forms):

  • F1 Score: 0.601

  • Recall: 0.732 (+22.2% vs baseline Tesseract)

  • Precision: 0.503

Strengths:

  • Higher recall than baseline Tesseract

  • Better reading order

  • Handles multi-column layouts

Weaknesses:

  • Lower precision (more false positives)

  • Slower than single-stage

  • Requires PaddleOCR

Best for: Documents where missing content is costly, complex layouts

Trade-off: Sacrifices precision for recall - finds more text but includes more noise.


7. Heron + Tesseract

IBM Heron-101 layout detection → Tesseract OCR for maximum text extraction.

Configuration: configs/heron-tesseract.yaml

extractor_id: pipeline
config:
  stages:
    - extractor_id: heron-layout
      config:
        model_variant: "101"
        confidence_threshold: 0.6
    - extractor_id: ocr-tesseract
      config:
        use_layout_metadata: true

Performance (FUNSD Forms):

  • F1 Score: 0.519

  • Recall: 0.810HIGHEST RECALL

  • Precision: 0.384

  • Bigram Overlap: 0.561 (best local ordering)

Strengths:

  • Finds 81% of all words - more than any other pipeline

  • Excellent local word ordering (bigrams)

  • Best for completeness

  • Strong layout understanding

Weaknesses:

  • Lowest precision (38.4%)

  • More false positives/noise

  • Slower processing

  • Lower F1 due to precision trade-off

Best for:

  • Applications where missing content is worse than noise

  • Documents requiring maximum text extraction

  • When completeness matters more than accuracy

  • Legal/compliance where you can’t miss text

Trade-off: Maximum recall at the cost of precision - extracts everything but includes more errors.

See Heron Implementation Guide for detailed information.


Vision-Language Models

Docling-Granite

Docling with IBM Granite Docling-258M VLM for high-accuracy extraction.

Configuration: configs/docling-granite.yaml

extractor_id: docling-granite
config:
  output_format: markdown

Strengths:

  • Higher accuracy than SmolDocling

  • Excellent for technical documents

  • Strong table understanding

Weaknesses:

  • Slower than SmolDocling

  • Higher resource requirements

Best for: When maximum VLM accuracy is needed, complex technical documents

Installation:

pip install "biblicus[docling]"

Creating Custom Pipelines

Single-Stage Custom Pipeline

# configs/my-custom-ocr.yaml
extractor_id: ocr-tesseract
config:
  lang: eng
  psm: 6  # Assume uniform block of text
  min_confidence: 0.6
  oem: 3  # Default LSTM engine

Multi-Stage Custom Pipeline

# configs/my-custom-pipeline.yaml
extractor_id: pipeline
config:
  stages:
    # Stage 1: Layout detection
    - extractor_id: heron-layout
      config:
        model_variant: "101"
        confidence_threshold: 0.7

    # Stage 2: OCR
    - extractor_id: ocr-paddleocr-vl
      config:
        use_layout_metadata: true
        lang: en

    # Stage 3: Post-processing (if available)
    - extractor_id: select-longest-text
      config: {}

Testing Your Custom Pipeline

from pathlib import Path
from biblicus import Corpus
from biblicus.evaluation.ocr_benchmark import OCRBenchmark
from biblicus.extraction import build_extraction_snapshot
import yaml

# Load your config
with open("configs/my-custom-pipeline.yaml") as f:
    config = yaml.safe_load(f)

# Build extraction snapshot
corpus = Corpus(Path("corpora/funsd_benchmark"))
snapshot = build_extraction_snapshot(
    corpus,
    extractor_id=config["extractor_id"],
    configuration_name="my-custom-pipeline",
    configuration=config["config"]
)

# Evaluate
benchmark = OCRBenchmark(corpus)
report = benchmark.evaluate_extraction(
    snapshot_reference=snapshot.snapshot_id,
    pipeline_config=config
)

# View results
report.print_summary()

Adding to Benchmark Suite

Edit scripts/benchmark_all_pipelines.py:

PIPELINE_CONFIGS = [
    # ... existing configs ...
    "configs/my-custom-pipeline.yaml",
]

Then run:

python scripts/benchmark_all_pipelines.py

Pipeline Selection Guide

By Use Case

Maximum Accuracy (F1): Use PaddleOCR

  • Best: Forms, receipts, general documents

  • F1: 0.787

Maximum Recall (Completeness): Use Heron + Tesseract

  • Best: Legal, compliance, when missing text is critical

  • Recall: 0.810

Speed-Critical: Use RapidOCR or Baseline Tesseract

  • Best: Real-time, embedded systems

  • Fast processing

Tables & Formulas: Use Docling-Smol or Docling-Granite

  • Best: Academic papers, technical documents

  • VLM-based understanding

Multi-Format Documents: Use Unstructured

  • Best: PDF, Word, HTML, mixed formats

  • General-purpose parser

By Document Type

Forms (FUNSD-like):

  1. PaddleOCR (F1: 0.787)

  2. Docling-Smol (F1: 0.728)

  3. Layout-Aware Tesseract (F1: 0.601, Recall: 0.732)

Receipts (dense text):

  1. PaddleOCR (best for entity extraction)

  2. Docling-Smol (good structure preservation)

Academic Papers (multi-column):

  1. Docling-Granite (best layout understanding)

  2. Docling-Smol (good tables/formulas)

  3. Heron + Tesseract (strong reading order)

Simple Text Documents:

  1. Baseline Tesseract (fast, sufficient)

  2. RapidOCR (lightweight alternative)

Performance Tuning

Improving Recall

If you’re missing too much text:

  • Try Heron + Tesseract (highest recall: 0.810)

  • Lower confidence thresholds

  • Use layout-aware pipelines

  • Consider multi-model ensembles

Improving Precision

If you’re getting too much noise:

  • Use PaddleOCR (best balance)

  • Increase confidence thresholds

  • Add post-processing filters

  • Use VLM-based models for cleaner output

Improving Speed

If processing is too slow:

  • Use RapidOCR or Tesseract baseline

  • Reduce image resolution

  • Skip layout detection stage

  • Process in parallel batches

Next Steps

References