Select Override Extractor

Extractor ID: select-override

Overview

The select-override extractor implements simple media type-based routing by always using the last extraction for matching items. It provides basic override logic where specific media types get special handling while others follow default behavior.

This extractor is useful when you want to override extraction results for specific media types, such as always using OCR output for images or VLM output for PDFs, regardless of what other extractors produced.

Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

pip install biblicus

Supported Media Types

All media types are supported. Selection behavior depends on configured media type patterns.

Configuration

Config Schema

class SelectOverrideConfig(BaseModel):
    media_type_patterns: List[str] = ["*/*"]
    fallback_to_first: bool = False

Configuration Options

Option	Type	Default	Description
`media_type_patterns`	list[str]	`["/"]`	Glob patterns for media types to override
`fallback_to_first`	bool	`false`	If true, use first extraction for non-matching types

Pattern Matching

Patterns use standard glob syntax:

*/* - Matches all media types
image/* - Matches all image types
application/pdf - Matches only PDF
audio/* - Matches all audio types

Selection Rules

The extractor selects text using the following rules:

Pattern match: If item media type matches any pattern, use last extraction
No match + fallback_to_first=true: Use first extraction
No match + fallback_to_first=false: Use last extraction
No extractions: Return None

Usage

Command Line

Select-override is always used within a pipeline:

biblicus extract my-corpus --extractor pipeline \
  --config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"ocr-rapidocr"},{"extractor_id":"select-override","config":{"media_type_patterns":["image/*"]}}]'

Configuration File

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-override
      config:
        media_type_patterns: ["image/*"]
        fallback_to_first: true

biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

corpus = Corpus.from_directory("my-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pdf-text"},
            {"extractor_id": "ocr-rapidocr"},
            {
                "extractor_id": "select-override",
                "config": {
                    "media_type_patterns": ["image/*"],
                    "fallback_to_first": True
                }
            }
        ]
    }
)

Examples

Override Images Only

Use OCR for images, text extraction for everything else:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-override
      config:
        media_type_patterns: ["image/*"]
        fallback_to_first: true

For images: Uses ocr-rapidocr (last) For PDFs: Uses pdf-text (first, due to fallback_to_first=true)

Override PDFs

Use VLM for PDFs, basic extraction for everything else:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: docling-smol
    - extractor_id: select-override
      config:
        media_type_patterns: ["application/pdf"]
        fallback_to_first: true

For PDFs: Uses docling-smol (last matching) For text files: Uses pass-through-text (first, due to fallback)

Override Multiple Types

Override specific types with different extractors:

from biblicus import Corpus

corpus = Corpus.from_directory("mixed-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "markitdown"},
            {"extractor_id": "ocr-rapidocr"},
            {
                "extractor_id": "select-override",
                "config": {
                    "media_type_patterns": ["image/*", "application/pdf"],
                    "fallback_to_first": True
                }
            }
        ]
    }
)

Always Use Last

Default behavior (no fallback):

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: docling-smol
    - extractor_id: select-override  # Uses last for all types

Behavior Details

Pattern Matching

Uses Python’s fnmatch for glob pattern matching:

# Exact match
"application/pdf" matches "application/pdf" only

# Wildcard
"image/*" matches "image/png", "image/jpeg", etc.

# Universal
"*/*" matches all media types

Last Wins

For matching types, the last extraction is always used, regardless of whether earlier extractions exist or are non-empty.

Fallback Behavior

When fallback_to_first=true and media type doesn’t match:

Use first extraction instead of last
Useful for preferring fast extractors for non-override types

When fallback_to_first=false (default):

Use last extraction for everything
Simpler logic, fewer cases

Pipeline Position

Select-override should be the last step in a pipeline. All extraction attempts should come before it.

When to Use Select-Override

Use select-override when:

You want simple media type-based routing
Last extractor should always win for specific types
You need basic override logic
Simplicity is important

Use select-smart-override when:

You need confidence-based selection
Intelligent fallback is desired
Quality metrics matter

Use select-text when:

Order matters (fast first)
First success is preferred
No media type routing needed

Use select-longest-text when:

Longest output is preferred
No routing needed

Best Practices

Place Override Extractors Last

Put the extractor you want to use for overrides at the end:

stages:
  - extractor_id: pdf-text        # Default for PDFs
  - extractor_id: docling-smol    # Override for PDFs
  - extractor_id: select-override
    config:
      media_type_patterns: ["application/pdf"]

Use Fallback for Efficiency

Enable fallback to prefer fast extractors for non-override types:

config:
  media_type_patterns: ["image/*"]
  fallback_to_first: true  # Use fast extractors for non-images

Be Specific with Patterns

Use specific patterns to avoid unintended matches:

# Good - specific
media_type_patterns: ["image/png", "image/jpeg"]

# Careful - broad
media_type_patterns: ["image/*"]

# Very broad
media_type_patterns: ["*/*"]

Always Place Last

Select-override should always be the final step:

stages:
  - extractor-1
  - extractor-2
  - extractor-3
  - select-override  # Always last

Use Cases

Image-Specific Processing

Use advanced OCR for images, basic extraction for documents:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: ocr-paddleocr-vl  # For images
    - extractor_id: select-override
      config:
        media_type_patterns: ["image/*"]
        fallback_to_first: true

PDF Override

Use VLM for PDFs, simpler extractors for other types:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: markitdown
    - extractor_id: docling-smol      # For PDFs
    - extractor_id: select-override
      config:
        media_type_patterns: ["application/pdf"]
        fallback_to_first: true

Multi-Type Override

Override multiple specific types:

from biblicus import Corpus

corpus = Corpus.from_directory("corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "ocr-rapidocr"},
            {
                "extractor_id": "select-override",
                "config": {
                    "media_type_patterns": ["image/*", "application/pdf"],
                    "fallback_to_first": True
                }
            }
        ]
    }
)

Comparison with Other Selectors

Feature	select-override	select-text	select-longest	select-smart-override
Selection	Last for pattern	First usable	Longest	Intelligent
Media type aware	✅	❌	❌	✅
Confidence aware	❌	❌	❌	✅
Quality aware	❌	❌	✅	✅
Complexity	Simple	Simple	Simple	Complex
Override control	Last only	None	None	Configurable

Select Override Extractor

Overview

Installation

Supported Media Types

Configuration

Config Schema

Configuration Options

Pattern Matching

Selection Rules

Usage

Command Line

Configuration File

Python API

Examples

Override Images Only

Override PDFs

Override Multiple Types

Always Use Last

Behavior Details

Pattern Matching

Last Wins

Fallback Behavior

Pipeline Position

When to Use Select-Override

Use select-override when:

Use select-smart-override when:

Use select-text when:

Use select-longest-text when:

Best Practices

Place Override Extractors Last

Use Fallback for Efficiency

Be Specific with Patterns

Always Place Last

Use Cases

Image-Specific Processing

PDF Override

Multi-Type Override

Comparison with Other Selectors

Related Extractors

Same Category

Frequently Combined With

See Also