Pipeline Utilities
Meta-extractors for combining, selecting, and orchestrating extraction strategies.
Pipeline Utilities
Overview
Pipeline utility extractors don’t extract text themselves. Instead, they coordinate other extractors to:
Try multiple extractors and select the best result
Override extraction for specific items
Chain extractors sequentially
Implement fallback strategies
Route items to different extractors
Available Extractors
Selection Extractors
select-text
Selects the first successful extractor from a list.
Use case: Fallback chain (try extractor A, fall back to B, then C)
extractor_id: select-text
config:
extractors:
- pdf-text
- ocr-rapidocr
- markitdown
select-longest-text
Runs all extractors and selects the longest output.
Use case: Maximize extracted content when multiple strategies work
extractor_id: select-longest-text
config:
extractors:
- pdf-text
- markitdown
- unstructured
Override Extractors
select-override
Overrides extraction for specific items by ID.
Use case: Manual overrides for problematic documents
extractor_id: select-override
config:
default_extractor: pdf-text
overrides:
- item_id: abc123
extractor: ocr-rapidocr
select-smart-override
Routes items to different extractors based on media type patterns.
Use case: Automatic routing by document type
extractor_id: select-smart-override
config:
default_extractor: pdf-text
overrides:
- media_type_pattern: "image/.*"
extractor: ocr-rapidocr
- media_type_pattern: "audio/.*"
extractor: stt-deepgram
Composition Extractor
pipeline
Chains multiple extractors sequentially, concatenating results.
Use case: Combine metadata with content, or process in stages
extractor_id: pipeline
config:
extractors:
- metadata-text
- pdf-text
Common Patterns
PDF Fallback Strategy
Try text extraction first, fall back to OCR:
extractor_id: select-text
config:
extractors:
- pdf-text # Fast for PDFs with text
- docling-smol # VLM for complex layouts
- ocr-rapidocr # Traditional OCR fallback
Media Type Routing
Route different media types to specialized extractors:
extractor_id: select-smart-override
config:
default_extractor: pass-through-text
overrides:
- media_type_pattern: "application/pdf"
extractor: pdf-text
- media_type_pattern: "image/.*"
extractor: ocr-rapidocr
- media_type_pattern: "audio/.*"
extractor: stt-deepgram
- media_type_pattern: "application/vnd\\.openxmlformats-officedocument\\..*"
extractor: markitdown
Maximum Coverage
Extract as much as possible by selecting longest output:
extractor_id: select-longest-text
config:
extractors:
- pdf-text
- markitdown
- unstructured
- ocr-rapidocr
Metadata + Content
Combine metadata with extracted content:
extractor_id: pipeline
config:
extractors:
- metadata-text # Extract title, tags
- pdf-text # Extract document content
Selective Override
Use default strategy with exceptions:
extractor_id: select-override
config:
default_extractor: pdf-text
overrides:
- item_id: problematic-doc-123
extractor: ocr-rapidocr
- item_id: complex-layout-456
extractor: docling-granite
Decision Tree
Which Utility Extractor to Use?
Need fallback behavior? → Use select-text
Tries extractors in order, stops at first success
Want to maximize output? → Use select-longest-text
Runs all extractors, picks longest result
Need per-document overrides? → Use select-override
Override specific items by ID
Want automatic routing? → Use select-smart-override
Route by media type pattern
Need sequential processing? → Use pipeline
Chain extractors, concatenate results
Performance Considerations
select-text (Short-Circuit)
Runs extractors sequentially
Stops at first success
Fast: Only runs what’s needed
Cost-effective: Minimal API calls
select-longest-text (Parallel)
Runs all extractors
Compares all outputs
Slower: Runs everything
Higher cost: All API calls executed
select-override (Conditional)
Routes to appropriate extractor
Fast: Single extractor per item
Efficient: No redundant processing
select-smart-override (Pattern-Based)
Routes by media type
Fast: Single extractor per item
Flexible: Pattern-based routing
pipeline (Sequential)
Runs extractors in order
Concatenates all results
Predictable: Always runs all stages
Comprehensive: Captures all outputs