Pipeline Catalog
Complete reference of all extraction pipelines available for benchmarking in Biblicus.
Overview
Biblicus includes 8+ pre-configured extraction pipelines with different speed/accuracy trade-offs. Each pipeline is defined in a YAML configuration file under configs/.
Pipeline Comparison Table
Pipeline |
F1 Score |
Recall |
Speed |
Use Case |
|---|---|---|---|---|
0.787 |
0.782 |
Medium |
Best overall accuracy |
|
0.728 |
0.675 |
Slow |
Tables & formulas |
|
0.649 |
0.626 |
Medium |
General documents |
|
0.607 |
0.599 |
Fast |
Simple baseline |
|
0.601 |
0.732 |
Medium |
High recall needs |
|
0.519 |
0.810 |
Slow |
Maximum extraction |
|
0.507 |
0.467 |
Fast |
Lightweight/embedded |
Basic OCR Pipelines
1. Baseline Tesseract
Simple Tesseract OCR without layout detection.
Configuration: configs/baseline-ocr.yaml
extractor_id: ocr-tesseract
config:
min_confidence: 0.0
lang: eng
Performance (FUNSD Forms):
F1 Score: 0.607
Recall: 0.599
Precision: 0.615
Strengths:
Fast processing
Minimal dependencies
Good baseline for comparison
Weaknesses:
No layout understanding
Struggles with complex formatting
Lower accuracy on forms
Best for: Simple documents, baseline comparisons, speed-critical applications
2. PaddleOCR
PaddleOCR with VL model - best overall performer.
Configuration: configs/ocr-paddleocr.yaml
extractor_id: ocr-paddleocr-vl
config:
lang: en
Performance (FUNSD Forms):
F1 Score: 0.787 ⭐ BEST F1
Recall: 0.782
Precision: 0.792
Word Error Rate: 0.533
Strengths:
Highest F1 score across benchmarks
Built-in layout detection
Good balance of precision and recall
Handles complex layouts
Weaknesses:
Requires PaddleOCR dependencies
Slower than Tesseract
Higher memory usage
Best for: Production systems, complex documents, when accuracy matters most
Installation:
pip install "biblicus[paddleocr]"
3. Docling-Smol
Docling with SmolDocling-256M vision-language model for document understanding.
Configuration: configs/docling-smol.yaml
extractor_id: docling-smol
config:
output_format: markdown
Performance (FUNSD Forms):
F1 Score: 0.728
Recall: 0.675
Precision: 0.788
Strengths:
Advanced VLM-based extraction
Excellent for tables and formulas
Structured output (markdown)
Good semantic understanding
Weaknesses:
Slower processing
Higher resource requirements
May be overkill for simple forms
Best for: Academic papers, technical documents, tables and formulas
Installation:
pip install "biblicus[docling]"
4. RapidOCR
Fast, lightweight OCR library for resource-constrained environments.
Configuration: configs/ocr-rapidocr.yaml
extractor_id: ocr-rapidocr
config:
use_det: true
use_cls: true
use_rec: true
Performance (FUNSD Forms):
F1 Score: 0.507
Recall: 0.467
Precision: 0.556
Strengths:
Very fast processing
Minimal dependencies
Low memory footprint
Good for embedded systems
Weaknesses:
Lower accuracy than PaddleOCR/Tesseract
Limited language support
Simpler text detection
Best for: Real-time applications, edge devices, resource constraints
Installation:
pip install "biblicus[ocr]"
5. Unstructured
Unstructured.io document parser with multi-format support.
Configuration: configs/unstructured.yaml
extractor_id: unstructured
config: {}
Performance (FUNSD Forms):
F1 Score: 0.649
Recall: 0.626
Precision: 0.673
Strengths:
Handles many document formats
Good general-purpose parser
Structured element extraction
PDF, Word, HTML, etc.
Weaknesses:
Heavy dependency footprint
Slower than specialized OCR
May be overkill for images only
Best for: Mixed document types, production pipelines handling various formats
Installation:
pip install "biblicus[unstructured]"
8. MarkItDown
Microsoft’s MarkItDown converter for document-to-markdown conversion.
Configuration: configs/markitdown.yaml
extractor_id: markitdown
config: {}
Strengths:
Excellent markdown output
Handles Office documents
Preserves structure
Weaknesses:
Requires Python 3.10+
Not optimized for OCR
Best for: Converting Office documents, markdown workflows
Installation:
pip install "biblicus[markitdown]"
Layout-Aware Pipelines
Layout-aware pipelines use a two-stage approach:
Layout detection to identify document regions and reading order
OCR on each region in sequence
This improves reading order and can increase recall at the cost of precision.
6. Layout-Aware Tesseract (PaddleOCR)
PaddleOCR PP-Structure layout detection → Tesseract OCR.
Configuration: configs/layout-aware-tesseract.yaml
extractor_id: pipeline
config:
stages:
- extractor_id: paddleocr-layout
config:
lang: en
- extractor_id: ocr-tesseract
config:
use_layout_metadata: true
Performance (FUNSD Forms):
F1 Score: 0.601
Recall: 0.732 (+22.2% vs baseline Tesseract)
Precision: 0.503
Strengths:
Higher recall than baseline Tesseract
Better reading order
Handles multi-column layouts
Weaknesses:
Lower precision (more false positives)
Slower than single-stage
Requires PaddleOCR
Best for: Documents where missing content is costly, complex layouts
Trade-off: Sacrifices precision for recall - finds more text but includes more noise.
7. Heron + Tesseract
IBM Heron-101 layout detection → Tesseract OCR for maximum text extraction.
Configuration: configs/heron-tesseract.yaml
extractor_id: pipeline
config:
stages:
- extractor_id: heron-layout
config:
model_variant: "101"
confidence_threshold: 0.6
- extractor_id: ocr-tesseract
config:
use_layout_metadata: true
Performance (FUNSD Forms):
F1 Score: 0.519
Recall: 0.810 ⭐ HIGHEST RECALL
Precision: 0.384
Bigram Overlap: 0.561 (best local ordering)
Strengths:
Finds 81% of all words - more than any other pipeline
Excellent local word ordering (bigrams)
Best for completeness
Strong layout understanding
Weaknesses:
Lowest precision (38.4%)
More false positives/noise
Slower processing
Lower F1 due to precision trade-off
Best for:
Applications where missing content is worse than noise
Documents requiring maximum text extraction
When completeness matters more than accuracy
Legal/compliance where you can’t miss text
Trade-off: Maximum recall at the cost of precision - extracts everything but includes more errors.
See Heron Implementation Guide for detailed information.
Vision-Language Models
Docling-Granite
Docling with IBM Granite Docling-258M VLM for high-accuracy extraction.
Configuration: configs/docling-granite.yaml
extractor_id: docling-granite
config:
output_format: markdown
Strengths:
Higher accuracy than SmolDocling
Excellent for technical documents
Strong table understanding
Weaknesses:
Slower than SmolDocling
Higher resource requirements
Best for: When maximum VLM accuracy is needed, complex technical documents
Installation:
pip install "biblicus[docling]"
Creating Custom Pipelines
Single-Stage Custom Pipeline
# configs/my-custom-ocr.yaml
extractor_id: ocr-tesseract
config:
lang: eng
psm: 6 # Assume uniform block of text
min_confidence: 0.6
oem: 3 # Default LSTM engine
Multi-Stage Custom Pipeline
# configs/my-custom-pipeline.yaml
extractor_id: pipeline
config:
stages:
# Stage 1: Layout detection
- extractor_id: heron-layout
config:
model_variant: "101"
confidence_threshold: 0.7
# Stage 2: OCR
- extractor_id: ocr-paddleocr-vl
config:
use_layout_metadata: true
lang: en
# Stage 3: Post-processing (if available)
- extractor_id: select-longest-text
config: {}
Testing Your Custom Pipeline
from pathlib import Path
from biblicus import Corpus
from biblicus.evaluation.ocr_benchmark import OCRBenchmark
from biblicus.extraction import build_extraction_snapshot
import yaml
# Load your config
with open("configs/my-custom-pipeline.yaml") as f:
config = yaml.safe_load(f)
# Build extraction snapshot
corpus = Corpus(Path("corpora/funsd_benchmark"))
snapshot = build_extraction_snapshot(
corpus,
extractor_id=config["extractor_id"],
configuration_name="my-custom-pipeline",
configuration=config["config"]
)
# Evaluate
benchmark = OCRBenchmark(corpus)
report = benchmark.evaluate_extraction(
snapshot_reference=snapshot.snapshot_id,
pipeline_config=config
)
# View results
report.print_summary()
Adding to Benchmark Suite
Edit scripts/benchmark_all_pipelines.py:
PIPELINE_CONFIGS = [
# ... existing configs ...
"configs/my-custom-pipeline.yaml",
]
Then run:
python scripts/benchmark_all_pipelines.py
Pipeline Selection Guide
By Use Case
Maximum Accuracy (F1): Use PaddleOCR
Best: Forms, receipts, general documents
F1: 0.787
Maximum Recall (Completeness): Use Heron + Tesseract
Best: Legal, compliance, when missing text is critical
Recall: 0.810
Speed-Critical: Use RapidOCR or Baseline Tesseract
Best: Real-time, embedded systems
Fast processing
Tables & Formulas: Use Docling-Smol or Docling-Granite
Best: Academic papers, technical documents
VLM-based understanding
Multi-Format Documents: Use Unstructured
Best: PDF, Word, HTML, mixed formats
General-purpose parser
By Document Type
Forms (FUNSD-like):
PaddleOCR (F1: 0.787)
Docling-Smol (F1: 0.728)
Layout-Aware Tesseract (F1: 0.601, Recall: 0.732)
Receipts (dense text):
PaddleOCR (best for entity extraction)
Docling-Smol (good structure preservation)
Academic Papers (multi-column):
Docling-Granite (best layout understanding)
Docling-Smol (good tables/formulas)
Heron + Tesseract (strong reading order)
Simple Text Documents:
Baseline Tesseract (fast, sufficient)
RapidOCR (lightweight alternative)
Performance Tuning
Improving Recall
If you’re missing too much text:
Try Heron + Tesseract (highest recall: 0.810)
Lower confidence thresholds
Use layout-aware pipelines
Consider multi-model ensembles
Improving Precision
If you’re getting too much noise:
Use PaddleOCR (best balance)
Increase confidence thresholds
Add post-processing filters
Use VLM-based models for cleaner output
Improving Speed
If processing is too slow:
Use RapidOCR or Tesseract baseline
Reduce image resolution
Skip layout detection stage
Process in parallel batches
Next Steps
Run benchmarks to compare pipelines on your data
Understand metrics to interpret results
View current results to see how pipelines compare
OCR Benchmarking Guide for practical how-to
References
Pipeline configurations:
configs/Benchmark scripts:
scripts/benchmark_*.pyEvaluation module:
src/biblicus/evaluation/ocr_benchmark.py