Select Text Extractor
Extractor ID: select-text
Category: Pipeline Utilities
Overview
The select-text extractor chooses the first usable extracted text from previous pipeline stages. It implements a simple, deterministic selection policy for fallback chains where multiple extractors may produce results for the same item.
This extractor is fundamental to pipeline composition, enabling graceful fallback patterns where you try fast extractors first and fall back to more powerful (but slower) alternatives.
Installation
No additional dependencies required. This extractor is part of the core Biblicus installation.
pip install biblicus
Supported Media Types
All media types are supported. This extractor operates on previous extraction results, not the raw item.
Configuration
Config Schema
class SelectTextExtractorConfig(BaseModel):
# This extractor requires no configuration
pass
Configuration Options
This extractor is intentionally minimal and accepts no configuration options.
Selection Rules
The extractor selects text using the following rules:
Usable text: Select the first extraction with non-empty text (after stripping whitespace)
Any text: If no usable text exists, select the first extraction even if empty
No extractions: If no previous extractions exist, return
None
This ensures deterministic behavior: given the same pipeline order and inputs, the same extraction is always selected.
Usage
Command Line
Select-text is always used within a pipeline:
biblicus extract my-corpus --extractor pipeline \
--config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"ocr-rapidocr"},{"extractor_id":"select-text"}]'
Configuration File
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text
- extractor_id: ocr-rapidocr
- extractor_id: select-text # Select first usable result
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
corpus = Corpus.from_directory("my-corpus")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "pdf-text"},
{"extractor_id": "ocr-rapidocr"},
{"extractor_id": "select-text"}
]
}
)
Examples
Fast-to-Slow Fallback
Try fast extraction first, fall back to slower methods:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text # Fastest
- extractor_id: pdf-text # Fast
- extractor_id: ocr-rapidocr # Moderate
- extractor_id: docling-smol # Slower
- extractor_id: select-text # Select first result
Text-First Strategy
Prefer text extraction over OCR:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: select-text # Prefer text over OCR
Multi-Format Corpus
Handle diverse document types:
from biblicus import Corpus
corpus = Corpus.from_directory("mixed-corpus")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "pass-through-text"},
{"extractor_id": "pdf-text"},
{"extractor_id": "markitdown"},
{"extractor_id": "unstructured"},
{"extractor_id": "select-text"}
]
}
)
OCR Fallback Chain
Try multiple OCR approaches:
extractor_id: pipeline
config:
stages:
- extractor_id: ocr-rapidocr
- extractor_id: ocr-paddleocr-vl
- extractor_id: docling-smol
- extractor_id: select-text
Behavior Details
Pipeline Position
Select-text should be the last step in a pipeline. All extraction attempts should come before it.
Empty Results
If all previous extractors produce empty text, select-text returns the first empty result (not None). This distinguishes “processed but empty” from “not processed.”
Source Tracking
The selected extraction retains its producer_extractor_id and source_stage_index, allowing you to identify which extractor produced the final text.
Determinism
Given the same pipeline configuration and inputs, select-text always produces the same result. This makes it suitable for reproducible research and testing.
When to Use Select-Text
Use select-text when:
You want the first successful extraction
Order matters (try cheap extractors first)
You need deterministic, predictable selection
Simplicity is preferred
Use select-longest-text when:
Multiple extractors may succeed
You want the most complete output
Order doesn’t matter
Use select-override when:
You want to override specific media types
Last extractor should win for certain items
Use select-smart-override when:
You need intelligent routing by media type
Quality metrics (confidence, length) matter
Best Practices
Order Extractors by Speed
Put faster extractors first:
stages:
- pass-through-text # Instant
- pdf-text # Fast
- markitdown # Moderate
- docling-smol # Slow
- select-text
Order Extractors by Accuracy
Or prioritize accuracy:
stages:
- docling-granite # Best accuracy
- docling-smol # Good accuracy
- ocr-rapidocr # Basic OCR
- select-text
Always Place Last
Select-text should always be the final step:
stages:
- extractor-1
- extractor-2
- extractor-3
- select-text # Always last
Combine with Media Type Routing
Use select-text within smart routing:
extractor_id: select-smart-override
config:
default_extractor: pipeline
default_config:
stages:
- extractor_id: pass-through-text
- extractor_id: select-text
overrides:
- media_type_pattern: "application/pdf"
extractor: pipeline
config:
stages:
- extractor_id: pdf-text
- extractor_id: docling-smol
- extractor_id: select-text
Use Cases
Heterogeneous Corpus
Process corpora with mixed document types:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: markitdown
- extractor_id: ocr-rapidocr
- extractor_id: select-text
Cost Optimization
Try free/cheap methods before expensive APIs:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text # Free
- extractor_id: pdf-text # Free
- extractor_id: stt-openai # Paid API
- extractor_id: select-text
Graceful Degradation
Provide fallbacks for when preferred extractors fail:
extractor_id: pipeline
config:
stages:
- extractor_id: docling-granite # Preferred
- extractor_id: docling-smol # Fallback 1
- extractor_id: ocr-rapidocr # Fallback 2
- extractor_id: metadata-text # Last resort
- extractor_id: select-text
Comparison with Other Selectors
Feature |
select-text |
select-longest |
select-override |
select-smart-override |
|---|---|---|---|---|
Selection |
First usable |
Longest text |
Last for pattern |
Intelligent |
Order matters |
✅ |
❌ |
Partial |
Partial |
Media type aware |
❌ |
❌ |
✅ |
✅ |
Confidence aware |
❌ |
❌ |
❌ |
✅ |
Complexity |
Simple |
Simple |
Moderate |
Complex |
See Also
extraction.md - Extraction pipeline concepts