Select Longest Text Extractor

Extractor ID: select-longest-text

Overview

The select-longest-text extractor chooses the extraction with the most text from previous pipeline stages. It implements a length-based selection policy for scenarios where multiple extractors may produce different outputs for the same item.

This extractor is useful when you want to maximize extracted content, assuming that longer outputs are more complete. It’s ideal for comparing different extraction methods and choosing the one that extracts the most information.

Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

pip install biblicus

Supported Media Types

All media types are supported. This extractor operates on previous extraction results, not the raw item.

Configuration

Config Schema

class SelectLongestTextExtractorConfig(BaseModel):
    # This extractor requires no configuration
    pass

Configuration Options

This extractor currently accepts no configuration options.

Selection Rules

The extractor selects text using the following rules:

Longest usable text: Select the extraction with the greatest character count (after stripping whitespace)
Tie breaking: If multiple extractions have the same length, select the earliest (lowest stage index)
No usable text: If all extractions are empty, select the earliest extraction
No extractions: If no previous extractions exist, return None

This provides deterministic selection that favors completeness.

Usage

Command Line

Select-longest-text is always used within a pipeline:

biblicus extract my-corpus --extractor pipeline \
  --config 'stages=[{"extractor_id":"ocr-rapidocr"},{"extractor_id":"docling-smol"},{"extractor_id":"select-longest-text"}]'

Configuration File

extractor_id: pipeline
config:
  stages:
    - extractor_id: ocr-rapidocr
    - extractor_id: docling-smol
    - extractor_id: select-longest-text  # Select longest result

biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

corpus = Corpus.from_directory("my-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "ocr-rapidocr"},
            {"extractor_id": "docling-smol"},
            {"extractor_id": "select-longest-text"}
        ]
    }
)

Examples

Compare OCR Methods

Try multiple OCR approaches and keep the best:

extractor_id: pipeline
config:
  stages:
    - extractor_id: ocr-rapidocr
    - extractor_id: ocr-paddleocr-vl
    - extractor_id: select-longest-text  # Keep most complete

Compare VLM Models

Test different VLM models and select the most thorough:

extractor_id: pipeline
config:
  stages:
    - extractor_id: docling-smol
    - extractor_id: docling-granite
    - extractor_id: select-longest-text

Maximize Extraction

Try all available methods and keep the most complete:

from biblicus import Corpus

corpus = Corpus.from_directory("complex-docs")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pdf-text"},
            {"extractor_id": "markitdown"},
            {"extractor_id": "docling-smol"},
            {"extractor_id": "select-longest-text"}
        ]
    }
)

Hybrid Extraction

Combine text extraction with OCR:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-longest-text

Behavior Details

Length Calculation

Text length is calculated after stripping whitespace. This prevents padding or formatting differences from affecting selection.

Pipeline Position

Select-longest-text should be the last step in a pipeline. All extraction attempts should come before it.

Parallel Extraction

All extractors in the pipeline run on the same item. This differs from select-text which stops at the first success.

Source Tracking

The selected extraction retains its producer_extractor_id and source_stage_index, allowing you to identify which extractor produced the final text.

Performance Consideration

Since all extractors run (not just until first success), this approach is slower but more thorough than select-text.

When to Use Select-Longest-Text

Use select-longest-text when:

You want the most complete extraction
Multiple extractors may produce different results
Completeness is more important than speed
You’re comparing extractor quality

Use select-text when:

Order matters (fast extractors first)
You want to stop at first success
Speed is more important than completeness

Use select-override when:

You want media type-based routing
Last extractor should win for patterns

Use select-smart-override when:

You need intelligent routing
Quality metrics (confidence, length) matter

Best Practices

Combine Similar Extractors

Group extractors that target the same content type:

# OCR comparison
stages:
  - ocr-rapidocr
  - ocr-paddleocr-vl
  - select-longest-text

# VLM comparison
stages:
  - docling-smol
  - docling-granite
  - select-longest-text

Consider Performance Trade-offs

Running all extractors is expensive:

# Expensive but thorough
stages:
  - pdf-text           # Fast
  - markitdown         # Moderate
  - docling-smol       # Slow
  - docling-granite    # Slower
  - select-longest-text

Consider using select-text for performance:

# Faster fallback chain
stages:
  - pdf-text
  - markitdown
  - docling-smol
  - select-text  # Stop at first success

Always Place Last

Select-longest-text should always be the final step:

stages:
  - extractor-1
  - extractor-2
  - extractor-3
  - select-longest-text  # Always last

Monitor Extraction Statistics

Track which extractors produce the longest outputs:

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "ocr-rapidocr"},
            {"extractor_id": "docling-smol"},
            {"extractor_id": "select-longest-text"}
        ]
    }
)

# Check which extractor was selected most often
# (This requires inspecting extraction metadata)

Use Cases

Quality Comparison

Compare different extractors to find the best:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: markitdown
    - extractor_id: unstructured
    - extractor_id: docling-smol
    - extractor_id: select-longest-text

Scanned PDF Processing

Try both text extraction and OCR:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text       # Works for digital PDFs
    - extractor_id: ocr-rapidocr   # Works for scanned PDFs
    - extractor_id: select-longest-text

Maximize Content Extraction

Extract as much text as possible:

from biblicus import Corpus

corpus = Corpus.from_directory("documents")

# Try everything, keep the best
results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "markitdown"},
            {"extractor_id": "ocr-rapidocr"},
            {"extractor_id": "docling-smol"},
            {"extractor_id": "select-longest-text"}
        ]
    }
)

Benchmark Extractors

Systematically compare extractor performance:

extractor_id: pipeline
config:
  stages:
    - extractor_id: markitdown
    - extractor_id: unstructured
    - extractor_id: docling-smol
    - extractor_id: select-longest-text

Comparison with Other Selectors

Feature	select-longest	select-text	select-override	select-smart-override
Selection	Longest text	First usable	Last for pattern	Intelligent
All run	✅	❌	✅	✅
Order matters	Tie-break only	✅	Partial	Partial
Performance	Slow	Fast	Moderate	Moderate
Use case	Quality comparison	Fast fallback	Media routing	Smart routing

Performance Considerations

All Extractors Run

Unlike select-text, all extractors run regardless of which produces the longest output. This means:

Extraction takes as long as the slowest extractor
API costs are incurred for all API-based extractors
Computational resources are used for all local extractors

When Performance Matters

If speed is critical, consider:

Using select-text instead
Reducing the number of extractors in the pipeline
Using only fast extractors

When Completeness Matters

If quality is critical, select-longest-text is ideal:

All extraction methods are attempted
The most thorough result is selected
No potential text is missed