# Pipeline Extractor

**Extractor ID:** `pipeline`

**Category:** [Pipeline Utilities](index.md)

## Overview

The pipeline extractor is a configuration shim that enables multi-stage extraction workflows. It allows you to compose multiple extractors into a sequential pipeline where each stage can build upon or choose from the results of previous stages.

Pipelines are the fundamental composition mechanism in Biblicus, enabling sophisticated extraction strategies like fallback chains, parallel extraction with selection, and media type-specific routing.

## Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

```bash
pip install biblicus
```

## Supported Media Types

All media types are supported. The pipeline delegates to configured extractors, each handling their own media types.

## Configuration

### Config Schema

```python
class PipelineStageSpec(BaseModel):
    extractor_id: str
    config: Dict[str, Any] = {}

class PipelineExtractorConfig(BaseModel):
    stages: List[PipelineStageSpec]
```

### Configuration Options

| Option | Type | Required | Description |
|--------|------|----------|-------------|
| `stages` | list | ✅ | Ordered list of extractor stages |
| `stages[].extractor_id` | str | ✅ | Extractor identifier for this stage |
| `stages[].config` | dict | ❌ | Configuration for this extractor |

### Constraints

- Must have at least one stage
- Cannot include `pipeline` as a stage (no nested pipelines)
- Stages are executed in order

## Usage

### Command Line

#### Simple Pipeline

```bash
biblicus extract my-corpus --extractor pipeline \
  --config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"select-text"}]'
```

#### Configuration File

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text
```

```bash
biblicus extract my-corpus --configuration configuration.yml
```

### Python API

```python
from biblicus import Corpus

corpus = Corpus.from_directory("my-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "ocr-rapidocr"},
            {"extractor_id": "select-text"}
        ]
    }
)
```

### With Per-Step Configuration

```python
results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {
                "extractor_id": "pdf-text",
                "config": {"max_pages": 100}
            },
            {
                "extractor_id": "ocr-rapidocr",
                "config": {"min_confidence": 0.7}
            },
            {"extractor_id": "select-longest-text"}
        ]
    }
)
```

## Pipeline Patterns

### Fallback Chain

Try extractors in order, use first success:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text  # Try text first
    - extractor_id: pdf-text           # Then PDF
    - extractor_id: markitdown         # Then Office docs
    - extractor_id: ocr-rapidocr       # Then OCR
    - extractor_id: select-text        # Use first success
```

### Parallel Extraction + Selection

Run all extractors, choose best:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: ocr-rapidocr
    - extractor_id: ocr-paddleocr-vl
    - extractor_id: docling-smol
    - extractor_id: select-longest-text  # Choose longest
```

### Media Type Routing

Route different types to different extractors:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: stt-openai
    - extractor_id: select-text
```

Text files → pass-through-text
PDFs → pdf-text
Images → ocr-rapidocr
Audio → stt-openai

### Smart Override

Intelligent quality-based routing:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text           # Fast
    - extractor_id: docling-smol       # Accurate
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["application/pdf"]
        min_confidence_threshold: 0.7
```

## Examples

### Simple Text Extraction

Handle text and PDFs:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: select-text
```

### Comprehensive Document Processing

Maximum format coverage:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: markitdown
    - extractor_id: unstructured
    - extractor_id: select-longest-text
```

### OCR with Fallback

Try fast OCR, fall back to VLM:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: ocr-rapidocr
    - extractor_id: docling-smol
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["image/*"]
        min_confidence_threshold: 0.75
```

### Multilingual Pipeline

Handle multiple languages:

```python
from biblicus import Corpus

corpus = Corpus.from_directory("multilingual")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {
                "extractor_id": "ocr-paddleocr-vl",
                "config": {"lang": "ch"}  # Chinese
            },
            {
                "extractor_id": "stt-openai",
                "config": {"language": "zh"}  # Chinese audio
            },
            {"extractor_id": "select-text"}
        ]
    }
)
```

### Cost-Optimized Pipeline

Try free methods before paid APIs:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text    # Free
    - extractor_id: pdf-text             # Free
    - extractor_id: markitdown           # Free
    - extractor_id: ocr-rapidocr         # Free
    - extractor_id: stt-deepgram         # Paid
    - extractor_id: select-text
```

## Behavior Details

### Sequential Execution

Stages execute in order. Each stage can access results from all previous stages.

### Per-Item Processing

The pipeline runs completely for each item before moving to the next. It does not process all items through stage 1, then all through stage 2.

### Previous Extractions

Selector extractors (select-text, etc.) receive all previous stage outputs for the current item.

### Short-Circuiting

Some patterns enable short-circuiting:
- select-text stops at first success
- Extractors can return None to skip

### Error Handling

Errors in individual stages are recorded but don't halt the pipeline. The pipeline continues with remaining stages.

### No Nested Pipelines

Pipelines cannot contain other pipeline extractors. This prevents infinite recursion and keeps configuration manageable.

## Performance Considerations

### Extraction Order

Order matters for performance:

```yaml
# Fast to slow (efficient)
stages:
  - pass-through-text  # Instant
  - pdf-text           # Fast
  - ocr-rapidocr       # Moderate
  - docling-smol       # Slow
  - select-text        # Stop at first success

# Slow to fast (inefficient)
stages:
  - docling-smol       # Runs for everything!
  - pass-through-text
  - select-text
```

### Selector Choice

- **select-text**: Stops at first success (efficient)
- **select-longest-text**: Runs all extractors (thorough but slow)
- **select-smart-override**: Runs all but intelligently chooses

### API Costs

Pipeline order affects API costs:

```yaml
# Cost-optimized
stages:
  - pass-through-text  # Free
  - pdf-text           # Free
  - stt-openai         # Paid - only runs if free methods fail
  - select-text

# Expensive
stages:
  - stt-openai         # Paid - runs for everything!
  - pass-through-text
  - select-longest-text
```

## Best Practices

### Always Include a Selector

End pipelines with a selection stage:

```yaml
stages:
  - extractor-1
  - extractor-2
  - select-text  # Always include
```

### Order by Speed or Priority

```yaml
# By speed (recommended)
stages:
  - fast-extractor
  - moderate-extractor
  - slow-extractor
  - select-text

# By accuracy
stages:
  - best-extractor
  - good-extractor
  - fallback-extractor
  - select-text
```

### Configure Steps Appropriately

Provide per-stage configuration when needed:

```yaml
stages:
  - extractor_id: pdf-text
    config:
      max_pages: 100
  - extractor_id: ocr-rapidocr
    config:
      min_confidence: 0.7
  - extractor_id: select-longest-text
```

### Use Configuration Files

For complex pipelines, always use configuration files:

```yaml
# configuration.yml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
      config:
        max_pages: 200
    - extractor_id: markitdown
      config:
        enable_plugins: false
    - extractor_id: ocr-rapidocr
      config:
        min_confidence: 0.6
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["application/pdf", "image/*"]
        min_confidence_threshold: 0.7
        min_text_length: 20
```

### Test on Samples

Always test pipelines on representative samples:

```bash
# Test on small corpus first
biblicus extract test-corpus --configuration pipeline.yml
```

## Common Pipeline Recipes

### Universal Pipeline

Handle any document type:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: markitdown
    - extractor_id: ocr-rapidocr
    - extractor_id: stt-openai
    - extractor_id: select-text
```

### Quality-First Pipeline

Prioritize accuracy:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: docling-granite
    - extractor_id: docling-smol
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text
```

### Speed-First Pipeline

Prioritize performance:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: metadata-text
    - extractor_id: select-text
```

### Research Pipeline

Maximum extraction quality:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: markitdown
    - extractor_id: ocr-paddleocr-vl
    - extractor_id: docling-granite
    - extractor_id: select-longest-text
```

## Limitations

### No Nested Pipelines

This is invalid:

```yaml
# ❌ Invalid - nested pipelines not allowed
extractor_id: pipeline
config:
  stages:
    - extractor_id: pipeline  # Not allowed!
      config:
        stages: [...]
```

### Linear Flow Only

Pipelines execute linearly. No branching or conditional logic (use selectors instead).

### No Step Communication

Steps cannot directly communicate. They only share via the extraction results list.

## Related Extractors

### Selection Utilities

- [select-text](select-text.md) - First non-empty selection
- [select-longest-text](select-longest.md) - Longest output selection
- [select-override](select-override.md) - Simple override selection
- [select-smart-override](select-smart-override.md) - Intelligent routing

### Frequently Combined Extractors

- [pass-through-text](../text-document/pass-through.md) - Text files
- [pdf-text](../text-document/pdf.md) - PDF extraction
- [markitdown](../text-document/markitdown.md) - Office documents
- [ocr-rapidocr](../ocr/rapidocr.md) - Fast OCR
- [stt-openai](../speech-to-text/openai.md) - Audio transcription
- [docling-smol](../vlm-document/docling-smol.md) - VLM extraction

## See Also

- [Pipeline Utilities Overview](index.md)
- [Extractors Index](../index.md)
- [extraction.md](../../extraction.md) - Extraction pipeline concepts
- [Configuration File Format](../../extraction.md#configuration-files)