Pipeline Extractor
Extractor ID: pipeline
Category: Pipeline Utilities
Overview
The pipeline extractor is a configuration shim that enables multi-stage extraction workflows. It allows you to compose multiple extractors into a sequential pipeline where each stage can build upon or choose from the results of previous stages.
Pipelines are the fundamental composition mechanism in Biblicus, enabling sophisticated extraction strategies like fallback chains, parallel extraction with selection, and media type-specific routing.
Installation
No additional dependencies required. This extractor is part of the core Biblicus installation.
pip install biblicus
Supported Media Types
All media types are supported. The pipeline delegates to configured extractors, each handling their own media types.
Configuration
Config Schema
class PipelineStageSpec(BaseModel):
extractor_id: str
config: Dict[str, Any] = {}
class PipelineExtractorConfig(BaseModel):
stages: List[PipelineStageSpec]
Configuration Options
Option |
Type |
Required |
Description |
|---|---|---|---|
|
list |
✅ |
Ordered list of extractor stages |
|
str |
✅ |
Extractor identifier for this stage |
|
dict |
❌ |
Configuration for this extractor |
Constraints
Must have at least one stage
Cannot include
pipelineas a stage (no nested pipelines)Stages are executed in order
Usage
Command Line
Simple Pipeline
biblicus extract my-corpus --extractor pipeline \
--config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"select-text"}]'
Configuration File
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: ocr-rapidocr
- extractor_id: select-text
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
corpus = Corpus.from_directory("my-corpus")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "pass-through-text"},
{"extractor_id": "pdf-text"},
{"extractor_id": "ocr-rapidocr"},
{"extractor_id": "select-text"}
]
}
)
With Per-Step Configuration
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{
"extractor_id": "pdf-text",
"config": {"max_pages": 100}
},
{
"extractor_id": "ocr-rapidocr",
"config": {"min_confidence": 0.7}
},
{"extractor_id": "select-longest-text"}
]
}
)
Pipeline Patterns
Fallback Chain
Try extractors in order, use first success:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text # Try text first
- extractor_id: pdf-text # Then PDF
- extractor_id: markitdown # Then Office docs
- extractor_id: ocr-rapidocr # Then OCR
- extractor_id: select-text # Use first success
Parallel Extraction + Selection
Run all extractors, choose best:
extractor_id: pipeline
config:
stages:
- extractor_id: ocr-rapidocr
- extractor_id: ocr-paddleocr-vl
- extractor_id: docling-smol
- extractor_id: select-longest-text # Choose longest
Media Type Routing
Route different types to different extractors:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: ocr-rapidocr
- extractor_id: stt-openai
- extractor_id: select-text
Text files → pass-through-text PDFs → pdf-text Images → ocr-rapidocr Audio → stt-openai
Smart Override
Intelligent quality-based routing:
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text # Fast
- extractor_id: docling-smol # Accurate
- extractor_id: select-smart-override
config:
media_type_patterns: ["application/pdf"]
min_confidence_threshold: 0.7
Examples
Simple Text Extraction
Handle text and PDFs:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: select-text
Comprehensive Document Processing
Maximum format coverage:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: markitdown
- extractor_id: unstructured
- extractor_id: select-longest-text
OCR with Fallback
Try fast OCR, fall back to VLM:
extractor_id: pipeline
config:
stages:
- extractor_id: ocr-rapidocr
- extractor_id: docling-smol
- extractor_id: select-smart-override
config:
media_type_patterns: ["image/*"]
min_confidence_threshold: 0.75
Multilingual Pipeline
Handle multiple languages:
from biblicus import Corpus
corpus = Corpus.from_directory("multilingual")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "pass-through-text"},
{
"extractor_id": "ocr-paddleocr-vl",
"config": {"lang": "ch"} # Chinese
},
{
"extractor_id": "stt-openai",
"config": {"language": "zh"} # Chinese audio
},
{"extractor_id": "select-text"}
]
}
)
Cost-Optimized Pipeline
Try free methods before paid APIs:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text # Free
- extractor_id: pdf-text # Free
- extractor_id: markitdown # Free
- extractor_id: ocr-rapidocr # Free
- extractor_id: stt-deepgram # Paid
- extractor_id: select-text
Behavior Details
Sequential Execution
Stages execute in order. Each stage can access results from all previous stages.
Per-Item Processing
The pipeline runs completely for each item before moving to the next. It does not process all items through stage 1, then all through stage 2.
Previous Extractions
Selector extractors (select-text, etc.) receive all previous stage outputs for the current item.
Short-Circuiting
Some patterns enable short-circuiting:
select-text stops at first success
Extractors can return None to skip
Error Handling
Errors in individual stages are recorded but don’t halt the pipeline. The pipeline continues with remaining stages.
No Nested Pipelines
Pipelines cannot contain other pipeline extractors. This prevents infinite recursion and keeps configuration manageable.
Performance Considerations
Extraction Order
Order matters for performance:
# Fast to slow (efficient)
stages:
- pass-through-text # Instant
- pdf-text # Fast
- ocr-rapidocr # Moderate
- docling-smol # Slow
- select-text # Stop at first success
# Slow to fast (inefficient)
stages:
- docling-smol # Runs for everything!
- pass-through-text
- select-text
Selector Choice
select-text: Stops at first success (efficient)
select-longest-text: Runs all extractors (thorough but slow)
select-smart-override: Runs all but intelligently chooses
API Costs
Pipeline order affects API costs:
# Cost-optimized
stages:
- pass-through-text # Free
- pdf-text # Free
- stt-openai # Paid - only runs if free methods fail
- select-text
# Expensive
stages:
- stt-openai # Paid - runs for everything!
- pass-through-text
- select-longest-text
Best Practices
Always Include a Selector
End pipelines with a selection stage:
stages:
- extractor-1
- extractor-2
- select-text # Always include
Order by Speed or Priority
# By speed (recommended)
stages:
- fast-extractor
- moderate-extractor
- slow-extractor
- select-text
# By accuracy
stages:
- best-extractor
- good-extractor
- fallback-extractor
- select-text
Configure Steps Appropriately
Provide per-stage configuration when needed:
stages:
- extractor_id: pdf-text
config:
max_pages: 100
- extractor_id: ocr-rapidocr
config:
min_confidence: 0.7
- extractor_id: select-longest-text
Use Configuration Files
For complex pipelines, always use configuration files:
# configuration.yml
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
config:
max_pages: 200
- extractor_id: markitdown
config:
enable_plugins: false
- extractor_id: ocr-rapidocr
config:
min_confidence: 0.6
- extractor_id: select-smart-override
config:
media_type_patterns: ["application/pdf", "image/*"]
min_confidence_threshold: 0.7
min_text_length: 20
Test on Samples
Always test pipelines on representative samples:
# Test on small corpus first
biblicus extract test-corpus --configuration pipeline.yml
Common Pipeline Recipes
Universal Pipeline
Handle any document type:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: markitdown
- extractor_id: ocr-rapidocr
- extractor_id: stt-openai
- extractor_id: select-text
Quality-First Pipeline
Prioritize accuracy:
extractor_id: pipeline
config:
stages:
- extractor_id: docling-granite
- extractor_id: docling-smol
- extractor_id: ocr-rapidocr
- extractor_id: select-text
Speed-First Pipeline
Prioritize performance:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: metadata-text
- extractor_id: select-text
Research Pipeline
Maximum extraction quality:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: markitdown
- extractor_id: ocr-paddleocr-vl
- extractor_id: docling-granite
- extractor_id: select-longest-text
Limitations
No Nested Pipelines
This is invalid:
# ❌ Invalid - nested pipelines not allowed
extractor_id: pipeline
config:
stages:
- extractor_id: pipeline # Not allowed!
config:
stages: [...]
Linear Flow Only
Pipelines execute linearly. No branching or conditional logic (use selectors instead).
No Step Communication
Steps cannot directly communicate. They only share via the extraction results list.
See Also
extraction.md - Extraction pipeline concepts