Select Text Extractor

Extractor ID: select-text

Overview

The select-text extractor chooses the first usable extracted text from previous pipeline stages. It implements a simple, deterministic selection policy for fallback chains where multiple extractors may produce results for the same item.

This extractor is fundamental to pipeline composition, enabling graceful fallback patterns where you try fast extractors first and fall back to more powerful (but slower) alternatives.

Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

pip install biblicus

Supported Media Types

All media types are supported. This extractor operates on previous extraction results, not the raw item.

Configuration

Config Schema

class SelectTextExtractorConfig(BaseModel):
    # This extractor requires no configuration
    pass

Configuration Options

This extractor is intentionally minimal and accepts no configuration options.

Selection Rules

The extractor selects text using the following rules:

Usable text: Select the first extraction with non-empty text (after stripping whitespace)
Any text: If no usable text exists, select the first extraction even if empty
No extractions: If no previous extractions exist, return None

This ensures deterministic behavior: given the same pipeline order and inputs, the same extraction is always selected.

Usage

Command Line

Select-text is always used within a pipeline:

biblicus extract my-corpus --extractor pipeline \
  --config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"ocr-rapidocr"},{"extractor_id":"select-text"}]'

Configuration File

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text  # Select first usable result

biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

corpus = Corpus.from_directory("my-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pdf-text"},
            {"extractor_id": "ocr-rapidocr"},
            {"extractor_id": "select-text"}
        ]
    }
)

Examples

Fast-to-Slow Fallback

Try fast extraction first, fall back to slower methods:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text  # Fastest
    - extractor_id: pdf-text           # Fast
    - extractor_id: ocr-rapidocr       # Moderate
    - extractor_id: docling-smol       # Slower
    - extractor_id: select-text        # Select first result

Text-First Strategy

Prefer text extraction over OCR:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: select-text  # Prefer text over OCR

Multi-Format Corpus

Handle diverse document types:

from biblicus import Corpus

corpus = Corpus.from_directory("mixed-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "markitdown"},
            {"extractor_id": "unstructured"},
            {"extractor_id": "select-text"}
        ]
    }
)

OCR Fallback Chain

Try multiple OCR approaches:

extractor_id: pipeline
config:
  stages:
    - extractor_id: ocr-rapidocr
    - extractor_id: ocr-paddleocr-vl
    - extractor_id: docling-smol
    - extractor_id: select-text

Behavior Details

Pipeline Position

Select-text should be the last step in a pipeline. All extraction attempts should come before it.

Empty Results

If all previous extractors produce empty text, select-text returns the first empty result (not None). This distinguishes “processed but empty” from “not processed.”

Source Tracking

The selected extraction retains its producer_extractor_id and source_stage_index, allowing you to identify which extractor produced the final text.

Determinism

Given the same pipeline configuration and inputs, select-text always produces the same result. This makes it suitable for reproducible research and testing.

When to Use Select-Text

Use select-text when:

You want the first successful extraction
Order matters (try cheap extractors first)
You need deterministic, predictable selection
Simplicity is preferred

Use select-longest-text when:

Multiple extractors may succeed
You want the most complete output
Order doesn’t matter

Use select-override when:

You want to override specific media types
Last extractor should win for certain items

Use select-smart-override when:

You need intelligent routing by media type
Quality metrics (confidence, length) matter

Best Practices

Order Extractors by Speed

Put faster extractors first:

stages:
  - pass-through-text  # Instant
  - pdf-text           # Fast
  - markitdown         # Moderate
  - docling-smol       # Slow
  - select-text

Order Extractors by Accuracy

Or prioritize accuracy:

stages:
  - docling-granite    # Best accuracy
  - docling-smol       # Good accuracy
  - ocr-rapidocr       # Basic OCR
  - select-text

Always Place Last

Select-text should always be the final step:

stages:
  - extractor-1
  - extractor-2
  - extractor-3
  - select-text  # Always last

Combine with Media Type Routing

Use select-text within smart routing:

extractor_id: select-smart-override
config:
  default_extractor: pipeline
  default_config:
    stages:
      - extractor_id: pass-through-text
      - extractor_id: select-text
  overrides:
    - media_type_pattern: "application/pdf"
      extractor: pipeline
      config:
        stages:
          - extractor_id: pdf-text
          - extractor_id: docling-smol
          - extractor_id: select-text

Use Cases

Heterogeneous Corpus

Process corpora with mixed document types:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: markitdown
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text

Cost Optimization

Try free/cheap methods before expensive APIs:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text  # Free
    - extractor_id: pdf-text           # Free
    - extractor_id: stt-openai         # Paid API
    - extractor_id: select-text

Graceful Degradation

Provide fallbacks for when preferred extractors fail:

extractor_id: pipeline
config:
  stages:
    - extractor_id: docling-granite  # Preferred
    - extractor_id: docling-smol     # Fallback 1
    - extractor_id: ocr-rapidocr     # Fallback 2
    - extractor_id: metadata-text    # Last resort
    - extractor_id: select-text

Comparison with Other Selectors

Feature	select-text	select-longest	select-override	select-smart-override
Selection	First usable	Longest text	Last for pattern	Intelligent
Order matters	✅	❌	Partial	Partial
Media type aware	❌	❌	✅	✅
Confidence aware	❌	❌	❌	✅
Complexity	Simple	Simple	Moderate	Complex