Select Override Extractor
Extractor ID: select-override
Category: Pipeline Utilities
Overview
The select-override extractor implements simple media type-based routing by always using the last extraction for matching items. It provides basic override logic where specific media types get special handling while others follow default behavior.
This extractor is useful when you want to override extraction results for specific media types, such as always using OCR output for images or VLM output for PDFs, regardless of what other extractors produced.
Installation
No additional dependencies required. This extractor is part of the core Biblicus installation.
pip install biblicus
Supported Media Types
All media types are supported. Selection behavior depends on configured media type patterns.
Configuration
Config Schema
class SelectOverrideConfig(BaseModel):
media_type_patterns: List[str] = ["*/*"]
fallback_to_first: bool = False
Configuration Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
list[str] |
|
Glob patterns for media types to override |
|
bool |
|
If true, use first extraction for non-matching types |
Pattern Matching
Patterns use standard glob syntax:
*/*- Matches all media typesimage/*- Matches all image typesapplication/pdf- Matches only PDFaudio/*- Matches all audio types
Selection Rules
The extractor selects text using the following rules:
Pattern match: If item media type matches any pattern, use last extraction
No match + fallback_to_first=true: Use first extraction
No match + fallback_to_first=false: Use last extraction
No extractions: Return
None
Usage
Command Line
Select-override is always used within a pipeline:
biblicus extract my-corpus --extractor pipeline \
--config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"ocr-rapidocr"},{"extractor_id":"select-override","config":{"media_type_patterns":["image/*"]}}]'
Configuration File
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text
- extractor_id: ocr-rapidocr
- extractor_id: select-override
config:
media_type_patterns: ["image/*"]
fallback_to_first: true
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
corpus = Corpus.from_directory("my-corpus")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "pdf-text"},
{"extractor_id": "ocr-rapidocr"},
{
"extractor_id": "select-override",
"config": {
"media_type_patterns": ["image/*"],
"fallback_to_first": True
}
}
]
}
)
Examples
Override Images Only
Use OCR for images, text extraction for everything else:
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text
- extractor_id: ocr-rapidocr
- extractor_id: select-override
config:
media_type_patterns: ["image/*"]
fallback_to_first: true
For images: Uses ocr-rapidocr (last) For PDFs: Uses pdf-text (first, due to fallback_to_first=true)
Override PDFs
Use VLM for PDFs, basic extraction for everything else:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: docling-smol
- extractor_id: select-override
config:
media_type_patterns: ["application/pdf"]
fallback_to_first: true
For PDFs: Uses docling-smol (last matching) For text files: Uses pass-through-text (first, due to fallback)
Override Multiple Types
Override specific types with different extractors:
from biblicus import Corpus
corpus = Corpus.from_directory("mixed-corpus")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "pass-through-text"},
{"extractor_id": "pdf-text"},
{"extractor_id": "markitdown"},
{"extractor_id": "ocr-rapidocr"},
{
"extractor_id": "select-override",
"config": {
"media_type_patterns": ["image/*", "application/pdf"],
"fallback_to_first": True
}
}
]
}
)
Always Use Last
Default behavior (no fallback):
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text
- extractor_id: docling-smol
- extractor_id: select-override # Uses last for all types
Behavior Details
Pattern Matching
Uses Python’s fnmatch for glob pattern matching:
# Exact match
"application/pdf" matches "application/pdf" only
# Wildcard
"image/*" matches "image/png", "image/jpeg", etc.
# Universal
"*/*" matches all media types
Last Wins
For matching types, the last extraction is always used, regardless of whether earlier extractions exist or are non-empty.
Fallback Behavior
When fallback_to_first=true and media type doesn’t match:
Use first extraction instead of last
Useful for preferring fast extractors for non-override types
When fallback_to_first=false (default):
Use last extraction for everything
Simpler logic, fewer cases
Pipeline Position
Select-override should be the last step in a pipeline. All extraction attempts should come before it.
When to Use Select-Override
Use select-override when:
You want simple media type-based routing
Last extractor should always win for specific types
You need basic override logic
Simplicity is important
Use select-smart-override when:
You need confidence-based selection
Intelligent fallback is desired
Quality metrics matter
Use select-text when:
Order matters (fast first)
First success is preferred
No media type routing needed
Use select-longest-text when:
Longest output is preferred
No routing needed
Best Practices
Place Override Extractors Last
Put the extractor you want to use for overrides at the end:
stages:
- extractor_id: pdf-text # Default for PDFs
- extractor_id: docling-smol # Override for PDFs
- extractor_id: select-override
config:
media_type_patterns: ["application/pdf"]
Use Fallback for Efficiency
Enable fallback to prefer fast extractors for non-override types:
config:
media_type_patterns: ["image/*"]
fallback_to_first: true # Use fast extractors for non-images
Be Specific with Patterns
Use specific patterns to avoid unintended matches:
# Good - specific
media_type_patterns: ["image/png", "image/jpeg"]
# Careful - broad
media_type_patterns: ["image/*"]
# Very broad
media_type_patterns: ["*/*"]
Always Place Last
Select-override should always be the final step:
stages:
- extractor-1
- extractor-2
- extractor-3
- select-override # Always last
Use Cases
Image-Specific Processing
Use advanced OCR for images, basic extraction for documents:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: ocr-paddleocr-vl # For images
- extractor_id: select-override
config:
media_type_patterns: ["image/*"]
fallback_to_first: true
PDF Override
Use VLM for PDFs, simpler extractors for other types:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: markitdown
- extractor_id: docling-smol # For PDFs
- extractor_id: select-override
config:
media_type_patterns: ["application/pdf"]
fallback_to_first: true
Multi-Type Override
Override multiple specific types:
from biblicus import Corpus
corpus = Corpus.from_directory("corpus")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "pass-through-text"},
{"extractor_id": "pdf-text"},
{"extractor_id": "ocr-rapidocr"},
{
"extractor_id": "select-override",
"config": {
"media_type_patterns": ["image/*", "application/pdf"],
"fallback_to_first": True
}
}
]
}
)
Comparison with Other Selectors
Feature |
select-override |
select-text |
select-longest |
select-smart-override |
|---|---|---|---|---|
Selection |
Last for pattern |
First usable |
Longest |
Intelligent |
Media type aware |
✅ |
❌ |
❌ |
✅ |
Confidence aware |
❌ |
❌ |
❌ |
✅ |
Quality aware |
❌ |
❌ |
✅ |
✅ |
Complexity |
Simple |
Simple |
Simple |
Complex |
Override control |
Last only |
None |
None |
Configurable |
See Also
extraction.md - Extraction pipeline concepts