# Pipeline Utilities

Meta-extractors for combining, selecting, and orchestrating extraction strategies.

```{toctree}
:maxdepth: 1
:caption: Pipeline Utilities

select-text
select-longest
select-override
select-smart-override
pipeline
```

## Overview

Pipeline utility extractors don't extract text themselves. Instead, they coordinate other extractors to:

- Try multiple extractors and select the best result
- Override extraction for specific items
- Chain extractors sequentially
- Implement fallback strategies
- Route items to different extractors

## Available Extractors

### Selection Extractors

#### [select-text](select-text.md)

Selects the first successful extractor from a list.

**Use case**: Fallback chain (try extractor A, fall back to B, then C)

```yaml
extractor_id: select-text
config:
  extractors:
    - pdf-text
    - ocr-rapidocr
    - markitdown
```

#### [select-longest-text](select-longest.md)

Runs all extractors and selects the longest output.

**Use case**: Maximize extracted content when multiple strategies work

```yaml
extractor_id: select-longest-text
config:
  extractors:
    - pdf-text
    - markitdown
    - unstructured
```

### Override Extractors

#### [select-override](select-override.md)

Overrides extraction for specific items by ID.

**Use case**: Manual overrides for problematic documents

```yaml
extractor_id: select-override
config:
  default_extractor: pdf-text
  overrides:
    - item_id: abc123
      extractor: ocr-rapidocr
```

#### [select-smart-override](select-smart-override.md)

Routes items to different extractors based on media type patterns.

**Use case**: Automatic routing by document type

```yaml
extractor_id: select-smart-override
config:
  default_extractor: pdf-text
  overrides:
    - media_type_pattern: "image/.*"
      extractor: ocr-rapidocr
    - media_type_pattern: "audio/.*"
      extractor: stt-deepgram
```

### Composition Extractor

#### [pipeline](pipeline.md)

Chains multiple extractors sequentially, concatenating results.

**Use case**: Combine metadata with content, or process in stages

```yaml
extractor_id: pipeline
config:
  extractors:
    - metadata-text
    - pdf-text
```

## Common Patterns

### PDF Fallback Strategy

Try text extraction first, fall back to OCR:

```yaml
extractor_id: select-text
config:
  extractors:
    - pdf-text          # Fast for PDFs with text
    - docling-smol       # VLM for complex layouts
    - ocr-rapidocr       # Traditional OCR fallback
```

### Media Type Routing

Route different media types to specialized extractors:

```yaml
extractor_id: select-smart-override
config:
  default_extractor: pass-through-text
  overrides:
    - media_type_pattern: "application/pdf"
      extractor: pdf-text
    - media_type_pattern: "image/.*"
      extractor: ocr-rapidocr
    - media_type_pattern: "audio/.*"
      extractor: stt-deepgram
    - media_type_pattern: "application/vnd\\.openxmlformats-officedocument\\..*"
      extractor: markitdown
```

### Maximum Coverage

Extract as much as possible by selecting longest output:

```yaml
extractor_id: select-longest-text
config:
  extractors:
    - pdf-text
    - markitdown
    - unstructured
    - ocr-rapidocr
```

### Metadata + Content

Combine metadata with extracted content:

```yaml
extractor_id: pipeline
config:
  extractors:
    - metadata-text    # Extract title, tags
    - pdf-text         # Extract document content
```

### Selective Override

Use default strategy with exceptions:

```yaml
extractor_id: select-override
config:
  default_extractor: pdf-text
  overrides:
    - item_id: problematic-doc-123
      extractor: ocr-rapidocr
    - item_id: complex-layout-456
      extractor: docling-granite
```

## Decision Tree

### Which Utility Extractor to Use?

1. **Need fallback behavior?** → Use [select-text](select-text.md)
   - Tries extractors in order, stops at first success

2. **Want to maximize output?** → Use [select-longest-text](select-longest.md)
   - Runs all extractors, picks longest result

3. **Need per-document overrides?** → Use [select-override](select-override.md)
   - Override specific items by ID

4. **Want automatic routing?** → Use [select-smart-override](select-smart-override.md)
   - Route by media type pattern

5. **Need sequential processing?** → Use [pipeline](pipeline.md)
   - Chain extractors, concatenate results

## Performance Considerations

### select-text (Short-Circuit)

- Runs extractors sequentially
- Stops at first success
- **Fast**: Only runs what's needed
- **Cost-effective**: Minimal API calls

### select-longest-text (Parallel)

- Runs all extractors
- Compares all outputs
- **Slower**: Runs everything
- **Higher cost**: All API calls executed

### select-override (Conditional)

- Routes to appropriate extractor
- **Fast**: Single extractor per item
- **Efficient**: No redundant processing

### select-smart-override (Pattern-Based)

- Routes by media type
- **Fast**: Single extractor per item
- **Flexible**: Pattern-based routing

### pipeline (Sequential)

- Runs extractors in order
- Concatenates all results
- **Predictable**: Always runs all stages
- **Comprehensive**: Captures all outputs

## See Also

- [Extractors Overview](../index.md)
- [Text & Document Processing](../text-document/index.md)
- [OCR Extractors](../ocr/index.md)
- [VLM Extractors](../vlm-document/index.md)
- [Speech-to-Text](../speech-to-text/index.md)