Pipeline Extractor

Extractor ID: pipeline

Category: Pipeline Utilities

Overview

The pipeline extractor is a configuration shim that enables multi-stage extraction workflows. It allows you to compose multiple extractors into a sequential pipeline where each stage can build upon or choose from the results of previous stages.

Pipelines are the fundamental composition mechanism in Biblicus, enabling sophisticated extraction strategies like fallback chains, parallel extraction with selection, and media type-specific routing.

Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

pip install biblicus

Supported Media Types

All media types are supported. The pipeline delegates to configured extractors, each handling their own media types.

Configuration

Config Schema

class PipelineStageSpec(BaseModel):
    extractor_id: str
    config: Dict[str, Any] = {}

class PipelineExtractorConfig(BaseModel):
    stages: List[PipelineStageSpec]

Configuration Options

Option

Type

Required

Description

stages

list

Ordered list of extractor stages

stages[].extractor_id

str

Extractor identifier for this stage

stages[].config

dict

Configuration for this extractor

Constraints

  • Must have at least one stage

  • Cannot include pipeline as a stage (no nested pipelines)

  • Stages are executed in order

Usage

Command Line

Simple Pipeline

biblicus extract my-corpus --extractor pipeline \
  --config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"select-text"}]'

Configuration File

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text
biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

corpus = Corpus.from_directory("my-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "ocr-rapidocr"},
            {"extractor_id": "select-text"}
        ]
    }
)

With Per-Step Configuration

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {
                "extractor_id": "pdf-text",
                "config": {"max_pages": 100}
            },
            {
                "extractor_id": "ocr-rapidocr",
                "config": {"min_confidence": 0.7}
            },
            {"extractor_id": "select-longest-text"}
        ]
    }
)

Pipeline Patterns

Fallback Chain

Try extractors in order, use first success:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text  # Try text first
    - extractor_id: pdf-text           # Then PDF
    - extractor_id: markitdown         # Then Office docs
    - extractor_id: ocr-rapidocr       # Then OCR
    - extractor_id: select-text        # Use first success

Parallel Extraction + Selection

Run all extractors, choose best:

extractor_id: pipeline
config:
  stages:
    - extractor_id: ocr-rapidocr
    - extractor_id: ocr-paddleocr-vl
    - extractor_id: docling-smol
    - extractor_id: select-longest-text  # Choose longest

Media Type Routing

Route different types to different extractors:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: stt-openai
    - extractor_id: select-text

Text files → pass-through-text PDFs → pdf-text Images → ocr-rapidocr Audio → stt-openai

Smart Override

Intelligent quality-based routing:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text           # Fast
    - extractor_id: docling-smol       # Accurate
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["application/pdf"]
        min_confidence_threshold: 0.7

Examples

Simple Text Extraction

Handle text and PDFs:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: select-text

Comprehensive Document Processing

Maximum format coverage:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: markitdown
    - extractor_id: unstructured
    - extractor_id: select-longest-text

OCR with Fallback

Try fast OCR, fall back to VLM:

extractor_id: pipeline
config:
  stages:
    - extractor_id: ocr-rapidocr
    - extractor_id: docling-smol
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["image/*"]
        min_confidence_threshold: 0.75

Multilingual Pipeline

Handle multiple languages:

from biblicus import Corpus

corpus = Corpus.from_directory("multilingual")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {
                "extractor_id": "ocr-paddleocr-vl",
                "config": {"lang": "ch"}  # Chinese
            },
            {
                "extractor_id": "stt-openai",
                "config": {"language": "zh"}  # Chinese audio
            },
            {"extractor_id": "select-text"}
        ]
    }
)

Cost-Optimized Pipeline

Try free methods before paid APIs:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text    # Free
    - extractor_id: pdf-text             # Free
    - extractor_id: markitdown           # Free
    - extractor_id: ocr-rapidocr         # Free
    - extractor_id: stt-deepgram         # Paid
    - extractor_id: select-text

Behavior Details

Sequential Execution

Stages execute in order. Each stage can access results from all previous stages.

Per-Item Processing

The pipeline runs completely for each item before moving to the next. It does not process all items through stage 1, then all through stage 2.

Previous Extractions

Selector extractors (select-text, etc.) receive all previous stage outputs for the current item.

Short-Circuiting

Some patterns enable short-circuiting:

  • select-text stops at first success

  • Extractors can return None to skip

Error Handling

Errors in individual stages are recorded but don’t halt the pipeline. The pipeline continues with remaining stages.

No Nested Pipelines

Pipelines cannot contain other pipeline extractors. This prevents infinite recursion and keeps configuration manageable.

Performance Considerations

Extraction Order

Order matters for performance:

# Fast to slow (efficient)
stages:
  - pass-through-text  # Instant
  - pdf-text           # Fast
  - ocr-rapidocr       # Moderate
  - docling-smol       # Slow
  - select-text        # Stop at first success

# Slow to fast (inefficient)
stages:
  - docling-smol       # Runs for everything!
  - pass-through-text
  - select-text

Selector Choice

  • select-text: Stops at first success (efficient)

  • select-longest-text: Runs all extractors (thorough but slow)

  • select-smart-override: Runs all but intelligently chooses

API Costs

Pipeline order affects API costs:

# Cost-optimized
stages:
  - pass-through-text  # Free
  - pdf-text           # Free
  - stt-openai         # Paid - only runs if free methods fail
  - select-text

# Expensive
stages:
  - stt-openai         # Paid - runs for everything!
  - pass-through-text
  - select-longest-text

Best Practices

Always Include a Selector

End pipelines with a selection stage:

stages:
  - extractor-1
  - extractor-2
  - select-text  # Always include

Order by Speed or Priority

# By speed (recommended)
stages:
  - fast-extractor
  - moderate-extractor
  - slow-extractor
  - select-text

# By accuracy
stages:
  - best-extractor
  - good-extractor
  - fallback-extractor
  - select-text

Configure Steps Appropriately

Provide per-stage configuration when needed:

stages:
  - extractor_id: pdf-text
    config:
      max_pages: 100
  - extractor_id: ocr-rapidocr
    config:
      min_confidence: 0.7
  - extractor_id: select-longest-text

Use Configuration Files

For complex pipelines, always use configuration files:

# configuration.yml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
      config:
        max_pages: 200
    - extractor_id: markitdown
      config:
        enable_plugins: false
    - extractor_id: ocr-rapidocr
      config:
        min_confidence: 0.6
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["application/pdf", "image/*"]
        min_confidence_threshold: 0.7
        min_text_length: 20

Test on Samples

Always test pipelines on representative samples:

# Test on small corpus first
biblicus extract test-corpus --configuration pipeline.yml

Common Pipeline Recipes

Universal Pipeline

Handle any document type:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: markitdown
    - extractor_id: ocr-rapidocr
    - extractor_id: stt-openai
    - extractor_id: select-text

Quality-First Pipeline

Prioritize accuracy:

extractor_id: pipeline
config:
  stages:
    - extractor_id: docling-granite
    - extractor_id: docling-smol
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text

Speed-First Pipeline

Prioritize performance:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: metadata-text
    - extractor_id: select-text

Research Pipeline

Maximum extraction quality:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: markitdown
    - extractor_id: ocr-paddleocr-vl
    - extractor_id: docling-granite
    - extractor_id: select-longest-text

Limitations

No Nested Pipelines

This is invalid:

# ❌ Invalid - nested pipelines not allowed
extractor_id: pipeline
config:
  stages:
    - extractor_id: pipeline  # Not allowed!
      config:
        stages: [...]

Linear Flow Only

Pipelines execute linearly. No branching or conditional logic (use selectors instead).

No Step Communication

Steps cannot directly communicate. They only share via the extraction results list.

See Also