Pipeline Utilities

Meta-extractors for combining, selecting, and orchestrating extraction strategies.

Pipeline Utilities

Overview

Pipeline utility extractors don’t extract text themselves. Instead, they coordinate other extractors to:

Try multiple extractors and select the best result
Override extraction for specific items
Chain extractors sequentially
Implement fallback strategies
Route items to different extractors

Available Extractors

Selection Extractors

select-text 

Selects the first successful extractor from a list.

Use case: Fallback chain (try extractor A, fall back to B, then C)

extractor_id: select-text
config:
  extractors:
    - pdf-text
    - ocr-rapidocr
    - markitdown

select-longest-text 

Runs all extractors and selects the longest output.

Use case: Maximize extracted content when multiple strategies work

extractor_id: select-longest-text
config:
  extractors:
    - pdf-text
    - markitdown
    - unstructured

Override Extractors

select-override 

Overrides extraction for specific items by ID.

Use case: Manual overrides for problematic documents

extractor_id: select-override
config:
  default_extractor: pdf-text
  overrides:
    - item_id: abc123
      extractor: ocr-rapidocr

select-smart-override 

Routes items to different extractors based on media type patterns.

Use case: Automatic routing by document type

extractor_id: select-smart-override
config:
  default_extractor: pdf-text
  overrides:
    - media_type_pattern: "image/.*"
      extractor: ocr-rapidocr
    - media_type_pattern: "audio/.*"
      extractor: stt-deepgram

Composition Extractor

pipeline 

Chains multiple extractors sequentially, concatenating results.

Use case: Combine metadata with content, or process in stages

extractor_id: pipeline
config:
  extractors:
    - metadata-text
    - pdf-text

Common Patterns

PDF Fallback Strategy

Try text extraction first, fall back to OCR:

extractor_id: select-text
config:
  extractors:
    - pdf-text          # Fast for PDFs with text
    - docling-smol       # VLM for complex layouts
    - ocr-rapidocr       # Traditional OCR fallback

Media Type Routing

Route different media types to specialized extractors:

extractor_id: select-smart-override
config:
  default_extractor: pass-through-text
  overrides:
    - media_type_pattern: "application/pdf"
      extractor: pdf-text
    - media_type_pattern: "image/.*"
      extractor: ocr-rapidocr
    - media_type_pattern: "audio/.*"
      extractor: stt-deepgram
    - media_type_pattern: "application/vnd\\.openxmlformats-officedocument\\..*"
      extractor: markitdown

Maximum Coverage

Extract as much as possible by selecting longest output:

extractor_id: select-longest-text
config:
  extractors:
    - pdf-text
    - markitdown
    - unstructured
    - ocr-rapidocr

Metadata + Content

Combine metadata with extracted content:

extractor_id: pipeline
config:
  extractors:
    - metadata-text    # Extract title, tags
    - pdf-text         # Extract document content

Selective Override

Use default strategy with exceptions:

extractor_id: select-override
config:
  default_extractor: pdf-text
  overrides:
    - item_id: problematic-doc-123
      extractor: ocr-rapidocr
    - item_id: complex-layout-456
      extractor: docling-granite

Decision Tree

Which Utility Extractor to Use?

Need fallback behavior? → Use select-text
- Tries extractors in order, stops at first success
Want to maximize output? → Use select-longest-text
- Runs all extractors, picks longest result
Need per-document overrides? → Use select-override
- Override specific items by ID
Want automatic routing? → Use select-smart-override
- Route by media type pattern
Need sequential processing? → Use pipeline
- Chain extractors, concatenate results

Performance Considerations

select-text (Short-Circuit)

Runs extractors sequentially
Stops at first success
Fast: Only runs what’s needed
Cost-effective: Minimal API calls

select-longest-text (Parallel)

Runs all extractors
Compares all outputs
Slower: Runs everything
Higher cost: All API calls executed

select-override (Conditional)

Routes to appropriate extractor
Fast: Single extractor per item
Efficient: No redundant processing

select-smart-override (Pattern-Based)

Routes by media type
Fast: Single extractor per item
Flexible: Pattern-based routing

pipeline (Sequential)

Runs extractors in order
Concatenates all results
Predictable: Always runs all stages
Comprehensive: Captures all outputs