Select Smart Override Extractor

Extractor ID: select-smart-override

Overview

The select-smart-override extractor implements intelligent media type-based routing with quality-aware selection. It compares extraction results using confidence scores and text length to make smart decisions about which output to use.

This is the most sophisticated selector in Biblicus, combining media type routing with quality assessment. It’s ideal for production pipelines where you want to override specific types with higher-quality extractors while falling back intelligently when those extractors fail or produce poor results.

Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

pip install biblicus

Supported Media Types

All media types are supported. Selection behavior depends on configured media type patterns.

Configuration

Config Schema

class SelectSmartOverrideConfig(BaseModel):
    media_type_patterns: List[str] = ["*/*"]
    min_confidence_threshold: float = 0.7
    min_text_length: int = 10

Configuration Options

Option	Type	Default	Description
`media_type_patterns`	list[str]	`["/"]`	Glob patterns for media types to consider
`min_confidence_threshold`	float	`0.7`	Minimum confidence to consider extraction good (0.0-1.0)
`min_text_length`	int	`10`	Minimum text length for meaningful content

Selection Rules

For items matching configured patterns, the extractor applies smart selection:

Last is meaningful: If the last extraction has meaningful content, use it
Previous is better: If last is empty/low-quality but a previous extraction is good, use the previous one
Use last anyway: Otherwise, use the last extraction (even if empty)

For non-matching items:

Always use the last extraction

Meaningful Content

Content is considered meaningful when:

Text length (stripped) >= min_text_length AND
Confidence >= min_confidence_threshold OR confidence is not available

This allows confident, substantial results to override less complete attempts.

Usage

Command Line

Select-smart-override is always used within a pipeline:

biblicus extract my-corpus --extractor pipeline \
  --config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"ocr-rapidocr"},{"extractor_id":"select-smart-override","config":{"media_type_patterns":["application/pdf"]}}]'

Configuration File

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["application/pdf"]
        min_confidence_threshold: 0.7
        min_text_length: 10

biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

corpus = Corpus.from_directory("my-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pdf-text"},
            {"extractor_id": "ocr-rapidocr"},
            {
                "extractor_id": "select-smart-override",
                "config": {
                    "media_type_patterns": ["application/pdf"],
                    "min_confidence_threshold": 0.75,
                    "min_text_length": 20
                }
            }
        ]
    }
)

Examples

Smart PDF Processing

Use OCR only when text extraction fails:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text      # Fast, works for digital PDFs
    - extractor_id: ocr-rapidocr  # Slower, for scanned PDFs
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["application/pdf"]
        min_confidence_threshold: 0.6
        min_text_length: 10

If pdf-text produces good content, use it. If empty/failed, use ocr-rapidocr.

Image with VLM Fallback

Try fast OCR first, fall back to VLM for poor results:

extractor_id: pipeline
config:
  stages:
    - extractor_id: ocr-rapidocr   # Fast
    - extractor_id: docling-smol   # Accurate but slower
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["image/*"]
        min_confidence_threshold: 0.7
        min_text_length: 20

If RapidOCR succeeds with high confidence, use it. Otherwise, use VLM.

High-Confidence Override

Only use expensive extractor when cheap one fails:

from biblicus import Corpus

corpus = Corpus.from_directory("documents")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "markitdown"},       # Fast
            {"extractor_id": "docling-granite"},  # Expensive
            {
                "extractor_id": "select-smart-override",
                "config": {
                    "media_type_patterns": ["application/pdf", "image/*"],
                    "min_confidence_threshold": 0.8,  # High bar
                    "min_text_length": 50
                }
            }
        ]
    }
)

Mixed Corpus with Smart Routing

Different strategies for different media types:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: docling-smol
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["application/pdf", "image/*"]
        min_confidence_threshold: 0.7
        min_text_length: 15

Behavior Details

Pattern Matching

Uses Python’s fnmatch for glob pattern matching:

# Specific
"application/pdf" matches PDFs only

# Wildcard
"image/*" matches all images

# Multiple types
["application/pdf", "image/*"] matches PDFs and images

Confidence Handling

With confidence: Must meet min_confidence_threshold
No confidence: Passes confidence check (assume good)
Mixed: Earlier extraction with confidence can override later without

Text Length

Text length is measured after stripping whitespace:

"   Hello   " → length 5 (not 11)

Pipeline Position

Select-smart-override should be the last step in a pipeline. All extraction attempts should come before it.

Comparison with Last

The smart override compares the last extraction against all previous extractions:

Check if last is meaningful
If not, find the most recent meaningful previous extraction
Use that one if it has good confidence
Otherwise, use last anyway

When to Use Select-Smart-Override

Use select-smart-override when:

You want intelligent quality-based selection
Confidence scores are available
Some extractors may fail or produce poor output
Cost/speed optimization matters

Use select-override when:

Simple last-wins logic is sufficient
No quality assessment needed
All extractors are equally reliable

Use select-text when:

First success is preferred
No quality comparison needed
Speed is critical

Use select-longest-text when:

Length is the only quality metric
No confidence scores available

Best Practices

Order Extractors by Cost/Speed

Put cheaper/faster extractors first:

stages:
  - extractor_id: pdf-text         # Free, fast
  - extractor_id: ocr-rapidocr     # Free, moderate
  - extractor_id: docling-granite  # Expensive, slow
  - extractor_id: select-smart-override

Tune Thresholds

Adjust thresholds based on your needs:

# Conservative - prefer high quality
config:
  min_confidence_threshold: 0.8
  min_text_length: 50

# Aggressive - prefer fast extractors
config:
  min_confidence_threshold: 0.5
  min_text_length: 10

Use with Confidence-Producing Extractors

Smart override works best with extractors that produce confidence scores:

OCR extractors (RapidOCR, PaddleOCR-VL)
VLM extractors (potentially)

Monitor Selection Behavior

Track which extractors are being selected:

# The selected extraction retains producer_extractor_id
# Use this to analyze selection patterns

Always Place Last

Select-smart-override should always be the final step:

stages:
  - extractor-1
  - extractor-2
  - extractor-3
  - select-smart-override  # Always last

Use Cases

Cost Optimization

Try free methods before expensive APIs:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text           # Free
    - extractor_id: stt-openai         # Paid
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["application/pdf", "audio/*"]
        min_confidence_threshold: 0.6

Quality Fallback

Use fast extractors when they work, expensive when they don’t:

extractor_id: pipeline
config:
  stages:
    - extractor_id: ocr-rapidocr    # Fast
    - extractor_id: docling-granite # Accurate
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["image/*"]
        min_confidence_threshold: 0.75
        min_text_length: 20

Scanned Document Detection

Auto-detect scanned PDFs and apply OCR:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text         # Fails on scanned PDFs
    - extractor_id: ocr-rapidocr     # Works on scanned PDFs
    - extractor_id: select-smart-override
      config:
        media_type_patterns: ["application/pdf"]
        min_text_length: 50  # PDF text extraction should get substantial content

Production Pipeline

Robust multi-format processing:

from biblicus import Corpus

corpus = Corpus.from_directory("production-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "markitdown"},
            {"extractor_id": "ocr-rapidocr"},
            {"extractor_id": "docling-smol"},
            {
                "extractor_id": "select-smart-override",
                "config": {
                    "media_type_patterns": ["*/*"],
                    "min_confidence_threshold": 0.7,
                    "min_text_length": 15
                }
            }
        ]
    }
)

Tuning Guidelines

Confidence Threshold

0.5-0.6: Permissive (accept most results)
0.7: Balanced (default)
0.8-0.9: Strict (only high-confidence)

Text Length

5-10: Short snippets acceptable
10-20: Moderate content required
50+: Substantial content required

Combined Tuning

# For screenshots/images (often short text)
config:
  min_confidence_threshold: 0.7
  min_text_length: 5

# For documents (expect longer text)
config:
  min_confidence_threshold: 0.7
  min_text_length: 50

Comparison with Other Selectors

Feature	select-smart-override	select-override	select-text	select-longest
Selection	Intelligent	Last for pattern	First usable	Longest
Media type aware	✅	✅	❌	❌
Confidence aware	✅	❌	❌	❌
Length aware	✅	❌	❌	✅
Complexity	High	Low	Low	Low
Best for	Production	Simple override	Fast fallback	Quality comparison

Select Smart Override Extractor

Overview

Installation

Supported Media Types

Configuration

Config Schema

Configuration Options

Selection Rules

Meaningful Content

Usage

Command Line

Configuration File

Python API

Examples

Smart PDF Processing

Image with VLM Fallback

High-Confidence Override

Mixed Corpus with Smart Routing

Behavior Details

Pattern Matching

Confidence Handling

Text Length

Pipeline Position

Comparison with Last

When to Use Select-Smart-Override

Use select-smart-override when:

Use select-override when:

Use select-text when:

Use select-longest-text when:

Best Practices

Order Extractors by Cost/Speed

Tune Thresholds

Use with Confidence-Producing Extractors

Monitor Selection Behavior

Always Place Last

Use Cases

Cost Optimization

Quality Fallback

Scanned Document Detection

Production Pipeline

Tuning Guidelines

Confidence Threshold

Text Length

Combined Tuning

Comparison with Other Selectors

Related Extractors

Same Category

Frequently Combined With

See Also