Select Smart Override Extractor
Extractor ID: select-smart-override
Category: Pipeline Utilities
Overview
The select-smart-override extractor implements intelligent media type-based routing with quality-aware selection. It compares extraction results using confidence scores and text length to make smart decisions about which output to use.
This is the most sophisticated selector in Biblicus, combining media type routing with quality assessment. It’s ideal for production pipelines where you want to override specific types with higher-quality extractors while falling back intelligently when those extractors fail or produce poor results.
Installation
No additional dependencies required. This extractor is part of the core Biblicus installation.
pip install biblicus
Supported Media Types
All media types are supported. Selection behavior depends on configured media type patterns.
Configuration
Config Schema
class SelectSmartOverrideConfig(BaseModel):
media_type_patterns: List[str] = ["*/*"]
min_confidence_threshold: float = 0.7
min_text_length: int = 10
Configuration Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
list[str] |
|
Glob patterns for media types to consider |
|
float |
|
Minimum confidence to consider extraction good (0.0-1.0) |
|
int |
|
Minimum text length for meaningful content |
Selection Rules
For items matching configured patterns, the extractor applies smart selection:
Last is meaningful: If the last extraction has meaningful content, use it
Previous is better: If last is empty/low-quality but a previous extraction is good, use the previous one
Use last anyway: Otherwise, use the last extraction (even if empty)
For non-matching items:
Always use the last extraction
Meaningful Content
Content is considered meaningful when:
Text length (stripped) >=
min_text_lengthANDConfidence >=
min_confidence_thresholdOR confidence is not available
This allows confident, substantial results to override less complete attempts.
Usage
Command Line
Select-smart-override is always used within a pipeline:
biblicus extract my-corpus --extractor pipeline \
--config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"ocr-rapidocr"},{"extractor_id":"select-smart-override","config":{"media_type_patterns":["application/pdf"]}}]'
Configuration File
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text
- extractor_id: ocr-rapidocr
- extractor_id: select-smart-override
config:
media_type_patterns: ["application/pdf"]
min_confidence_threshold: 0.7
min_text_length: 10
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
corpus = Corpus.from_directory("my-corpus")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "pdf-text"},
{"extractor_id": "ocr-rapidocr"},
{
"extractor_id": "select-smart-override",
"config": {
"media_type_patterns": ["application/pdf"],
"min_confidence_threshold": 0.75,
"min_text_length": 20
}
}
]
}
)
Examples
Smart PDF Processing
Use OCR only when text extraction fails:
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text # Fast, works for digital PDFs
- extractor_id: ocr-rapidocr # Slower, for scanned PDFs
- extractor_id: select-smart-override
config:
media_type_patterns: ["application/pdf"]
min_confidence_threshold: 0.6
min_text_length: 10
If pdf-text produces good content, use it. If empty/failed, use ocr-rapidocr.
Image with VLM Fallback
Try fast OCR first, fall back to VLM for poor results:
extractor_id: pipeline
config:
stages:
- extractor_id: ocr-rapidocr # Fast
- extractor_id: docling-smol # Accurate but slower
- extractor_id: select-smart-override
config:
media_type_patterns: ["image/*"]
min_confidence_threshold: 0.7
min_text_length: 20
If RapidOCR succeeds with high confidence, use it. Otherwise, use VLM.
High-Confidence Override
Only use expensive extractor when cheap one fails:
from biblicus import Corpus
corpus = Corpus.from_directory("documents")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "markitdown"}, # Fast
{"extractor_id": "docling-granite"}, # Expensive
{
"extractor_id": "select-smart-override",
"config": {
"media_type_patterns": ["application/pdf", "image/*"],
"min_confidence_threshold": 0.8, # High bar
"min_text_length": 50
}
}
]
}
)
Mixed Corpus with Smart Routing
Different strategies for different media types:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: ocr-rapidocr
- extractor_id: docling-smol
- extractor_id: select-smart-override
config:
media_type_patterns: ["application/pdf", "image/*"]
min_confidence_threshold: 0.7
min_text_length: 15
Behavior Details
Pattern Matching
Uses Python’s fnmatch for glob pattern matching:
# Specific
"application/pdf" matches PDFs only
# Wildcard
"image/*" matches all images
# Multiple types
["application/pdf", "image/*"] matches PDFs and images
Confidence Handling
With confidence: Must meet
min_confidence_thresholdNo confidence: Passes confidence check (assume good)
Mixed: Earlier extraction with confidence can override later without
Text Length
Text length is measured after stripping whitespace:
" Hello " → length 5 (not 11)
Pipeline Position
Select-smart-override should be the last step in a pipeline. All extraction attempts should come before it.
Comparison with Last
The smart override compares the last extraction against all previous extractions:
Check if last is meaningful
If not, find the most recent meaningful previous extraction
Use that one if it has good confidence
Otherwise, use last anyway
When to Use Select-Smart-Override
Use select-smart-override when:
You want intelligent quality-based selection
Confidence scores are available
Some extractors may fail or produce poor output
Cost/speed optimization matters
Use select-override when:
Simple last-wins logic is sufficient
No quality assessment needed
All extractors are equally reliable
Use select-text when:
First success is preferred
No quality comparison needed
Speed is critical
Use select-longest-text when:
Length is the only quality metric
No confidence scores available
Best Practices
Order Extractors by Cost/Speed
Put cheaper/faster extractors first:
stages:
- extractor_id: pdf-text # Free, fast
- extractor_id: ocr-rapidocr # Free, moderate
- extractor_id: docling-granite # Expensive, slow
- extractor_id: select-smart-override
Tune Thresholds
Adjust thresholds based on your needs:
# Conservative - prefer high quality
config:
min_confidence_threshold: 0.8
min_text_length: 50
# Aggressive - prefer fast extractors
config:
min_confidence_threshold: 0.5
min_text_length: 10
Use with Confidence-Producing Extractors
Smart override works best with extractors that produce confidence scores:
OCR extractors (RapidOCR, PaddleOCR-VL)
VLM extractors (potentially)
Monitor Selection Behavior
Track which extractors are being selected:
# The selected extraction retains producer_extractor_id
# Use this to analyze selection patterns
Always Place Last
Select-smart-override should always be the final step:
stages:
- extractor-1
- extractor-2
- extractor-3
- select-smart-override # Always last
Use Cases
Cost Optimization
Try free methods before expensive APIs:
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text # Free
- extractor_id: stt-openai # Paid
- extractor_id: select-smart-override
config:
media_type_patterns: ["application/pdf", "audio/*"]
min_confidence_threshold: 0.6
Quality Fallback
Use fast extractors when they work, expensive when they don’t:
extractor_id: pipeline
config:
stages:
- extractor_id: ocr-rapidocr # Fast
- extractor_id: docling-granite # Accurate
- extractor_id: select-smart-override
config:
media_type_patterns: ["image/*"]
min_confidence_threshold: 0.75
min_text_length: 20
Scanned Document Detection
Auto-detect scanned PDFs and apply OCR:
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text # Fails on scanned PDFs
- extractor_id: ocr-rapidocr # Works on scanned PDFs
- extractor_id: select-smart-override
config:
media_type_patterns: ["application/pdf"]
min_text_length: 50 # PDF text extraction should get substantial content
Production Pipeline
Robust multi-format processing:
from biblicus import Corpus
corpus = Corpus.from_directory("production-corpus")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "pass-through-text"},
{"extractor_id": "pdf-text"},
{"extractor_id": "markitdown"},
{"extractor_id": "ocr-rapidocr"},
{"extractor_id": "docling-smol"},
{
"extractor_id": "select-smart-override",
"config": {
"media_type_patterns": ["*/*"],
"min_confidence_threshold": 0.7,
"min_text_length": 15
}
}
]
}
)
Tuning Guidelines
Confidence Threshold
0.5-0.6: Permissive (accept most results)
0.7: Balanced (default)
0.8-0.9: Strict (only high-confidence)
Text Length
5-10: Short snippets acceptable
10-20: Moderate content required
50+: Substantial content required
Combined Tuning
# For screenshots/images (often short text)
config:
min_confidence_threshold: 0.7
min_text_length: 5
# For documents (expect longer text)
config:
min_confidence_threshold: 0.7
min_text_length: 50
Comparison with Other Selectors
Feature |
select-smart-override |
select-override |
select-text |
select-longest |
|---|---|---|---|---|
Selection |
Intelligent |
Last for pattern |
First usable |
Longest |
Media type aware |
✅ |
✅ |
❌ |
❌ |
Confidence aware |
✅ |
❌ |
❌ |
❌ |
Length aware |
✅ |
❌ |
❌ |
✅ |
Complexity |
High |
Low |
Low |
Low |
Best for |
Production |
Simple override |
Fast fallback |
Quality comparison |
See Also
extraction.md - Extraction pipeline concepts