Select Text Extractor

Extractor ID: select-text

Category: Pipeline Utilities

Overview

The select-text extractor chooses the first usable extracted text from previous pipeline stages. It implements a simple, deterministic selection policy for fallback chains where multiple extractors may produce results for the same item.

This extractor is fundamental to pipeline composition, enabling graceful fallback patterns where you try fast extractors first and fall back to more powerful (but slower) alternatives.

Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

pip install biblicus

Supported Media Types

All media types are supported. This extractor operates on previous extraction results, not the raw item.

Configuration

Config Schema

class SelectTextExtractorConfig(BaseModel):
    # This extractor requires no configuration
    pass

Configuration Options

This extractor is intentionally minimal and accepts no configuration options.

Selection Rules

The extractor selects text using the following rules:

  1. Usable text: Select the first extraction with non-empty text (after stripping whitespace)

  2. Any text: If no usable text exists, select the first extraction even if empty

  3. No extractions: If no previous extractions exist, return None

This ensures deterministic behavior: given the same pipeline order and inputs, the same extraction is always selected.

Usage

Command Line

Select-text is always used within a pipeline:

biblicus extract my-corpus --extractor pipeline \
  --config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"ocr-rapidocr"},{"extractor_id":"select-text"}]'

Configuration File

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text  # Select first usable result
biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

corpus = Corpus.from_directory("my-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pdf-text"},
            {"extractor_id": "ocr-rapidocr"},
            {"extractor_id": "select-text"}
        ]
    }
)

Examples

Fast-to-Slow Fallback

Try fast extraction first, fall back to slower methods:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text  # Fastest
    - extractor_id: pdf-text           # Fast
    - extractor_id: ocr-rapidocr       # Moderate
    - extractor_id: docling-smol       # Slower
    - extractor_id: select-text        # Select first result

Text-First Strategy

Prefer text extraction over OCR:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: select-text  # Prefer text over OCR

Multi-Format Corpus

Handle diverse document types:

from biblicus import Corpus

corpus = Corpus.from_directory("mixed-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "markitdown"},
            {"extractor_id": "unstructured"},
            {"extractor_id": "select-text"}
        ]
    }
)

OCR Fallback Chain

Try multiple OCR approaches:

extractor_id: pipeline
config:
  stages:
    - extractor_id: ocr-rapidocr
    - extractor_id: ocr-paddleocr-vl
    - extractor_id: docling-smol
    - extractor_id: select-text

Behavior Details

Pipeline Position

Select-text should be the last step in a pipeline. All extraction attempts should come before it.

Empty Results

If all previous extractors produce empty text, select-text returns the first empty result (not None). This distinguishes “processed but empty” from “not processed.”

Source Tracking

The selected extraction retains its producer_extractor_id and source_stage_index, allowing you to identify which extractor produced the final text.

Determinism

Given the same pipeline configuration and inputs, select-text always produces the same result. This makes it suitable for reproducible research and testing.

When to Use Select-Text

Use select-text when:

  • You want the first successful extraction

  • Order matters (try cheap extractors first)

  • You need deterministic, predictable selection

  • Simplicity is preferred

Use select-longest-text when:

  • Multiple extractors may succeed

  • You want the most complete output

  • Order doesn’t matter

Use select-override when:

  • You want to override specific media types

  • Last extractor should win for certain items

Use select-smart-override when:

  • You need intelligent routing by media type

  • Quality metrics (confidence, length) matter

Best Practices

Order Extractors by Speed

Put faster extractors first:

stages:
  - pass-through-text  # Instant
  - pdf-text           # Fast
  - markitdown         # Moderate
  - docling-smol       # Slow
  - select-text

Order Extractors by Accuracy

Or prioritize accuracy:

stages:
  - docling-granite    # Best accuracy
  - docling-smol       # Good accuracy
  - ocr-rapidocr       # Basic OCR
  - select-text

Always Place Last

Select-text should always be the final step:

stages:
  - extractor-1
  - extractor-2
  - extractor-3
  - select-text  # Always last

Combine with Media Type Routing

Use select-text within smart routing:

extractor_id: select-smart-override
config:
  default_extractor: pipeline
  default_config:
    stages:
      - extractor_id: pass-through-text
      - extractor_id: select-text
  overrides:
    - media_type_pattern: "application/pdf"
      extractor: pipeline
      config:
        stages:
          - extractor_id: pdf-text
          - extractor_id: docling-smol
          - extractor_id: select-text

Use Cases

Heterogeneous Corpus

Process corpora with mixed document types:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: markitdown
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text

Cost Optimization

Try free/cheap methods before expensive APIs:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text  # Free
    - extractor_id: pdf-text           # Free
    - extractor_id: stt-openai         # Paid API
    - extractor_id: select-text

Graceful Degradation

Provide fallbacks for when preferred extractors fail:

extractor_id: pipeline
config:
  stages:
    - extractor_id: docling-granite  # Preferred
    - extractor_id: docling-smol     # Fallback 1
    - extractor_id: ocr-rapidocr     # Fallback 2
    - extractor_id: metadata-text    # Last resort
    - extractor_id: select-text

Comparison with Other Selectors

Feature

select-text

select-longest

select-override

select-smart-override

Selection

First usable

Longest text

Last for pattern

Intelligent

Order matters

Partial

Partial

Media type aware

Confidence aware

Complexity

Simple

Simple

Moderate

Complex

See Also