# Pipeline Catalog

Complete reference of all extraction pipelines available for benchmarking in Biblicus.

## Overview

Biblicus includes 8+ pre-configured extraction pipelines with different speed/accuracy trade-offs. Each pipeline is defined in a YAML configuration file under `configs/`.

## Pipeline Comparison Table

| Pipeline | F1 Score | Recall | Speed | Use Case |
|----------|----------|--------|-------|----------|
| [PaddleOCR](#2-paddleocr) | **0.787** | 0.782 | Medium | Best overall accuracy |
| [Docling-Smol](#3-docling-smol) | 0.728 | 0.675 | Slow | Tables & formulas |
| [Unstructured](#5-unstructured) | 0.649 | 0.626 | Medium | General documents |
| [Baseline Tesseract](#1-baseline-tesseract) | 0.607 | 0.599 | Fast | Simple baseline |
| [Layout-Aware Tesseract (PaddleOCR)](#6-layout-aware-tesseract-paddleocr) | 0.601 | 0.732 | Medium | High recall needs |
| [Heron + Tesseract](#7-heron--tesseract) | 0.519 | **0.810** | Slow | Maximum extraction |
| [RapidOCR](#4-rapidocr) | 0.507 | 0.467 | **Fast** | Lightweight/embedded |

## Basic OCR Pipelines

### 1. Baseline Tesseract

Simple Tesseract OCR without layout detection.

**Configuration:** `configs/baseline-ocr.yaml`

```yaml
extractor_id: ocr-tesseract
config:
  min_confidence: 0.0
  lang: eng
```

**Performance (FUNSD Forms):**
- F1 Score: 0.607
- Recall: 0.599
- Precision: 0.615

**Strengths:**
- Fast processing
- Minimal dependencies
- Good baseline for comparison

**Weaknesses:**
- No layout understanding
- Struggles with complex formatting
- Lower accuracy on forms

**Best for:** Simple documents, baseline comparisons, speed-critical applications

---

### 2. PaddleOCR

PaddleOCR with VL model - **best overall performer**.

**Configuration:** `configs/ocr-paddleocr.yaml`

```yaml
extractor_id: ocr-paddleocr-vl
config:
  lang: en
```

**Performance (FUNSD Forms):**
- F1 Score: **0.787** ⭐ **BEST F1**
- Recall: 0.782
- Precision: 0.792
- Word Error Rate: 0.533

**Strengths:**
- Highest F1 score across benchmarks
- Built-in layout detection
- Good balance of precision and recall
- Handles complex layouts

**Weaknesses:**
- Requires PaddleOCR dependencies
- Slower than Tesseract
- Higher memory usage

**Best for:** Production systems, complex documents, when accuracy matters most

**Installation:**
```bash
pip install "biblicus[paddleocr]"
```

---

### 3. Docling-Smol

Docling with SmolDocling-256M vision-language model for document understanding.

**Configuration:** `configs/docling-smol.yaml`

```yaml
extractor_id: docling-smol
config:
  output_format: markdown
```

**Performance (FUNSD Forms):**
- F1 Score: 0.728
- Recall: 0.675
- Precision: 0.788

**Strengths:**
- Advanced VLM-based extraction
- Excellent for tables and formulas
- Structured output (markdown)
- Good semantic understanding

**Weaknesses:**
- Slower processing
- Higher resource requirements
- May be overkill for simple forms

**Best for:** Academic papers, technical documents, tables and formulas

**Installation:**
```bash
pip install "biblicus[docling]"
```

---

### 4. RapidOCR

Fast, lightweight OCR library for resource-constrained environments.

**Configuration:** `configs/ocr-rapidocr.yaml`

```yaml
extractor_id: ocr-rapidocr
config:
  use_det: true
  use_cls: true
  use_rec: true
```

**Performance (FUNSD Forms):**
- F1 Score: 0.507
- Recall: 0.467
- Precision: 0.556

**Strengths:**
- Very fast processing
- Minimal dependencies
- Low memory footprint
- Good for embedded systems

**Weaknesses:**
- Lower accuracy than PaddleOCR/Tesseract
- Limited language support
- Simpler text detection

**Best for:** Real-time applications, edge devices, resource constraints

**Installation:**
```bash
pip install "biblicus[ocr]"
```

---

### 5. Unstructured

Unstructured.io document parser with multi-format support.

**Configuration:** `configs/unstructured.yaml`

```yaml
extractor_id: unstructured
config: {}
```

**Performance (FUNSD Forms):**
- F1 Score: 0.649
- Recall: 0.626
- Precision: 0.673

**Strengths:**
- Handles many document formats
- Good general-purpose parser
- Structured element extraction
- PDF, Word, HTML, etc.

**Weaknesses:**
- Heavy dependency footprint
- Slower than specialized OCR
- May be overkill for images only

**Best for:** Mixed document types, production pipelines handling various formats

**Installation:**
```bash
pip install "biblicus[unstructured]"
```

---

### 8. MarkItDown

Microsoft's MarkItDown converter for document-to-markdown conversion.

**Configuration:** `configs/markitdown.yaml`

```yaml
extractor_id: markitdown
config: {}
```

**Strengths:**
- Excellent markdown output
- Handles Office documents
- Preserves structure

**Weaknesses:**
- Requires Python 3.10+
- Not optimized for OCR

**Best for:** Converting Office documents, markdown workflows

**Installation:**
```bash
pip install "biblicus[markitdown]"
```

---

## Layout-Aware Pipelines

Layout-aware pipelines use a two-stage approach:
1. **Layout detection** to identify document regions and reading order
2. **OCR** on each region in sequence

This improves reading order and can increase recall at the cost of precision.

### 6. Layout-Aware Tesseract (PaddleOCR)

PaddleOCR PP-Structure layout detection → Tesseract OCR.

**Configuration:** `configs/layout-aware-tesseract.yaml`

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: paddleocr-layout
      config:
        lang: en
    - extractor_id: ocr-tesseract
      config:
        use_layout_metadata: true
```

**Performance (FUNSD Forms):**
- F1 Score: 0.601
- Recall: 0.732 (+22.2% vs baseline Tesseract)
- Precision: 0.503

**Strengths:**
- Higher recall than baseline Tesseract
- Better reading order
- Handles multi-column layouts

**Weaknesses:**
- Lower precision (more false positives)
- Slower than single-stage
- Requires PaddleOCR

**Best for:** Documents where missing content is costly, complex layouts

**Trade-off:** Sacrifices precision for recall - finds more text but includes more noise.

---

### 7. Heron + Tesseract

IBM Heron-101 layout detection → Tesseract OCR for maximum text extraction.

**Configuration:** `configs/heron-tesseract.yaml`

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: heron-layout
      config:
        model_variant: "101"
        confidence_threshold: 0.6
    - extractor_id: ocr-tesseract
      config:
        use_layout_metadata: true
```

**Performance (FUNSD Forms):**
- F1 Score: 0.519
- Recall: **0.810** ⭐ **HIGHEST RECALL**
- Precision: 0.384
- Bigram Overlap: **0.561** (best local ordering)

**Strengths:**
- Finds 81% of all words - more than any other pipeline
- Excellent local word ordering (bigrams)
- Best for completeness
- Strong layout understanding

**Weaknesses:**
- Lowest precision (38.4%)
- More false positives/noise
- Slower processing
- Lower F1 due to precision trade-off

**Best for:**
- Applications where missing content is worse than noise
- Documents requiring maximum text extraction
- When completeness matters more than accuracy
- Legal/compliance where you can't miss text

**Trade-off:** Maximum recall at the cost of precision - extracts everything but includes more errors.

See [Heron Implementation Guide](heron-implementation.md) for detailed information.

---

## Vision-Language Models

### Docling-Granite

Docling with IBM Granite Docling-258M VLM for high-accuracy extraction.

**Configuration:** `configs/docling-granite.yaml`

```yaml
extractor_id: docling-granite
config:
  output_format: markdown
```

**Strengths:**
- Higher accuracy than SmolDocling
- Excellent for technical documents
- Strong table understanding

**Weaknesses:**
- Slower than SmolDocling
- Higher resource requirements

**Best for:** When maximum VLM accuracy is needed, complex technical documents

**Installation:**
```bash
pip install "biblicus[docling]"
```

---

## Creating Custom Pipelines

### Single-Stage Custom Pipeline

```yaml
# configs/my-custom-ocr.yaml
extractor_id: ocr-tesseract
config:
  lang: eng
  psm: 6  # Assume uniform block of text
  min_confidence: 0.6
  oem: 3  # Default LSTM engine
```

### Multi-Stage Custom Pipeline

```yaml
# configs/my-custom-pipeline.yaml
extractor_id: pipeline
config:
  stages:
    # Stage 1: Layout detection
    - extractor_id: heron-layout
      config:
        model_variant: "101"
        confidence_threshold: 0.7

    # Stage 2: OCR
    - extractor_id: ocr-paddleocr-vl
      config:
        use_layout_metadata: true
        lang: en

    # Stage 3: Post-processing (if available)
    - extractor_id: select-longest-text
      config: {}
```

### Testing Your Custom Pipeline

```python
from pathlib import Path
from biblicus import Corpus
from biblicus.evaluation.ocr_benchmark import OCRBenchmark
from biblicus.extraction import build_extraction_snapshot
import yaml

# Load your config
with open("configs/my-custom-pipeline.yaml") as f:
    config = yaml.safe_load(f)

# Build extraction snapshot
corpus = Corpus(Path("corpora/funsd_benchmark"))
snapshot = build_extraction_snapshot(
    corpus,
    extractor_id=config["extractor_id"],
    configuration_name="my-custom-pipeline",
    configuration=config["config"]
)

# Evaluate
benchmark = OCRBenchmark(corpus)
report = benchmark.evaluate_extraction(
    snapshot_reference=snapshot.snapshot_id,
    pipeline_config=config
)

# View results
report.print_summary()
```

### Adding to Benchmark Suite

Edit `scripts/benchmark_all_pipelines.py`:

```python
PIPELINE_CONFIGS = [
    # ... existing configs ...
    "configs/my-custom-pipeline.yaml",
]
```

Then run:

```bash
python scripts/benchmark_all_pipelines.py
```

## Pipeline Selection Guide

### By Use Case

**Maximum Accuracy (F1):** Use [PaddleOCR](#2-paddleocr)
- Best: Forms, receipts, general documents
- F1: 0.787

**Maximum Recall (Completeness):** Use [Heron + Tesseract](#7-heron--tesseract)
- Best: Legal, compliance, when missing text is critical
- Recall: 0.810

**Speed-Critical:** Use [RapidOCR](#4-rapidocr) or [Baseline Tesseract](#1-baseline-tesseract)
- Best: Real-time, embedded systems
- Fast processing

**Tables & Formulas:** Use [Docling-Smol](#3-docling-smol) or [Docling-Granite](#docling-granite)
- Best: Academic papers, technical documents
- VLM-based understanding

**Multi-Format Documents:** Use [Unstructured](#5-unstructured)
- Best: PDF, Word, HTML, mixed formats
- General-purpose parser

### By Document Type

**Forms (FUNSD-like):**
1. PaddleOCR (F1: 0.787)
2. Docling-Smol (F1: 0.728)
3. Layout-Aware Tesseract (F1: 0.601, Recall: 0.732)

**Receipts (dense text):**
1. PaddleOCR (best for entity extraction)
2. Docling-Smol (good structure preservation)

**Academic Papers (multi-column):**
1. Docling-Granite (best layout understanding)
2. Docling-Smol (good tables/formulas)
3. Heron + Tesseract (strong reading order)

**Simple Text Documents:**
1. Baseline Tesseract (fast, sufficient)
2. RapidOCR (lightweight alternative)

## Performance Tuning

### Improving Recall

If you're missing too much text:
- Try Heron + Tesseract (highest recall: 0.810)
- Lower confidence thresholds
- Use layout-aware pipelines
- Consider multi-model ensembles

### Improving Precision

If you're getting too much noise:
- Use PaddleOCR (best balance)
- Increase confidence thresholds
- Add post-processing filters
- Use VLM-based models for cleaner output

### Improving Speed

If processing is too slow:
- Use RapidOCR or Tesseract baseline
- Reduce image resolution
- Skip layout detection stage
- Process in parallel batches

## Next Steps

- **[Run benchmarks](quickstart-benchmarking.md)** to compare pipelines on your data
- **[Understand metrics](metrics-reference.md)** to interpret results
- **[View current results](benchmark-results.md)** to see how pipelines compare
- **[OCR Benchmarking Guide](ocr-benchmarking.md)** for practical how-to

## References

- Pipeline configurations: `configs/`
- Benchmark scripts: `scripts/benchmark_*.py`
- Evaluation module: `src/biblicus/evaluation/ocr_benchmark.py`
- [Heron Implementation Details](heron-implementation.md)
- [Layout-Aware OCR Analysis](layout-aware-ocr-results.md)