Select Override Extractor

Extractor ID: select-override

Category: Pipeline Utilities

Overview

The select-override extractor implements simple media type-based routing by always using the last extraction for matching items. It provides basic override logic where specific media types get special handling while others follow default behavior.

This extractor is useful when you want to override extraction results for specific media types, such as always using OCR output for images or VLM output for PDFs, regardless of what other extractors produced.

Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

pip install biblicus

Supported Media Types

All media types are supported. Selection behavior depends on configured media type patterns.

Configuration

Config Schema

class SelectOverrideConfig(BaseModel):
    media_type_patterns: List[str] = ["*/*"]
    fallback_to_first: bool = False

Configuration Options

Option

Type

Default

Description

media_type_patterns

list[str]

["*/*"]

Glob patterns for media types to override

fallback_to_first

bool

false

If true, use first extraction for non-matching types

Pattern Matching

Patterns use standard glob syntax:

  • */* - Matches all media types

  • image/* - Matches all image types

  • application/pdf - Matches only PDF

  • audio/* - Matches all audio types

Selection Rules

The extractor selects text using the following rules:

  1. Pattern match: If item media type matches any pattern, use last extraction

  2. No match + fallback_to_first=true: Use first extraction

  3. No match + fallback_to_first=false: Use last extraction

  4. No extractions: Return None

Usage

Command Line

Select-override is always used within a pipeline:

biblicus extract my-corpus --extractor pipeline \
  --config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"ocr-rapidocr"},{"extractor_id":"select-override","config":{"media_type_patterns":["image/*"]}}]'

Configuration File

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-override
      config:
        media_type_patterns: ["image/*"]
        fallback_to_first: true
biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

corpus = Corpus.from_directory("my-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pdf-text"},
            {"extractor_id": "ocr-rapidocr"},
            {
                "extractor_id": "select-override",
                "config": {
                    "media_type_patterns": ["image/*"],
                    "fallback_to_first": True
                }
            }
        ]
    }
)

Examples

Override Images Only

Use OCR for images, text extraction for everything else:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-override
      config:
        media_type_patterns: ["image/*"]
        fallback_to_first: true

For images: Uses ocr-rapidocr (last) For PDFs: Uses pdf-text (first, due to fallback_to_first=true)

Override PDFs

Use VLM for PDFs, basic extraction for everything else:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: docling-smol
    - extractor_id: select-override
      config:
        media_type_patterns: ["application/pdf"]
        fallback_to_first: true

For PDFs: Uses docling-smol (last matching) For text files: Uses pass-through-text (first, due to fallback)

Override Multiple Types

Override specific types with different extractors:

from biblicus import Corpus

corpus = Corpus.from_directory("mixed-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "markitdown"},
            {"extractor_id": "ocr-rapidocr"},
            {
                "extractor_id": "select-override",
                "config": {
                    "media_type_patterns": ["image/*", "application/pdf"],
                    "fallback_to_first": True
                }
            }
        ]
    }
)

Always Use Last

Default behavior (no fallback):

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: docling-smol
    - extractor_id: select-override  # Uses last for all types

Behavior Details

Pattern Matching

Uses Python’s fnmatch for glob pattern matching:

# Exact match
"application/pdf" matches "application/pdf" only

# Wildcard
"image/*" matches "image/png", "image/jpeg", etc.

# Universal
"*/*" matches all media types

Last Wins

For matching types, the last extraction is always used, regardless of whether earlier extractions exist or are non-empty.

Fallback Behavior

When fallback_to_first=true and media type doesn’t match:

  • Use first extraction instead of last

  • Useful for preferring fast extractors for non-override types

When fallback_to_first=false (default):

  • Use last extraction for everything

  • Simpler logic, fewer cases

Pipeline Position

Select-override should be the last step in a pipeline. All extraction attempts should come before it.

When to Use Select-Override

Use select-override when:

  • You want simple media type-based routing

  • Last extractor should always win for specific types

  • You need basic override logic

  • Simplicity is important

Use select-smart-override when:

  • You need confidence-based selection

  • Intelligent fallback is desired

  • Quality metrics matter

Use select-text when:

  • Order matters (fast first)

  • First success is preferred

  • No media type routing needed

Use select-longest-text when:

  • Longest output is preferred

  • No routing needed

Best Practices

Place Override Extractors Last

Put the extractor you want to use for overrides at the end:

stages:
  - extractor_id: pdf-text        # Default for PDFs
  - extractor_id: docling-smol    # Override for PDFs
  - extractor_id: select-override
    config:
      media_type_patterns: ["application/pdf"]

Use Fallback for Efficiency

Enable fallback to prefer fast extractors for non-override types:

config:
  media_type_patterns: ["image/*"]
  fallback_to_first: true  # Use fast extractors for non-images

Be Specific with Patterns

Use specific patterns to avoid unintended matches:

# Good - specific
media_type_patterns: ["image/png", "image/jpeg"]

# Careful - broad
media_type_patterns: ["image/*"]

# Very broad
media_type_patterns: ["*/*"]

Always Place Last

Select-override should always be the final step:

stages:
  - extractor-1
  - extractor-2
  - extractor-3
  - select-override  # Always last

Use Cases

Image-Specific Processing

Use advanced OCR for images, basic extraction for documents:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: ocr-paddleocr-vl  # For images
    - extractor_id: select-override
      config:
        media_type_patterns: ["image/*"]
        fallback_to_first: true

PDF Override

Use VLM for PDFs, simpler extractors for other types:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: markitdown
    - extractor_id: docling-smol      # For PDFs
    - extractor_id: select-override
      config:
        media_type_patterns: ["application/pdf"]
        fallback_to_first: true

Multi-Type Override

Override multiple specific types:

from biblicus import Corpus

corpus = Corpus.from_directory("corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "ocr-rapidocr"},
            {
                "extractor_id": "select-override",
                "config": {
                    "media_type_patterns": ["image/*", "application/pdf"],
                    "fallback_to_first": True
                }
            }
        ]
    }
)

Comparison with Other Selectors

Feature

select-override

select-text

select-longest

select-smart-override

Selection

Last for pattern

First usable

Longest

Intelligent

Media type aware

Confidence aware

Quality aware

Complexity

Simple

Simple

Simple

Complex

Override control

Last only

None

None

Configurable

See Also