Select Longest Text Extractor
Extractor ID: select-longest-text
Category: Pipeline Utilities
Overview
The select-longest-text extractor chooses the extraction with the most text from previous pipeline stages. It implements a length-based selection policy for scenarios where multiple extractors may produce different outputs for the same item.
This extractor is useful when you want to maximize extracted content, assuming that longer outputs are more complete. It’s ideal for comparing different extraction methods and choosing the one that extracts the most information.
Installation
No additional dependencies required. This extractor is part of the core Biblicus installation.
pip install biblicus
Supported Media Types
All media types are supported. This extractor operates on previous extraction results, not the raw item.
Configuration
Config Schema
class SelectLongestTextExtractorConfig(BaseModel):
# This extractor requires no configuration
pass
Configuration Options
This extractor currently accepts no configuration options.
Selection Rules
The extractor selects text using the following rules:
Longest usable text: Select the extraction with the greatest character count (after stripping whitespace)
Tie breaking: If multiple extractions have the same length, select the earliest (lowest stage index)
No usable text: If all extractions are empty, select the earliest extraction
No extractions: If no previous extractions exist, return
None
This provides deterministic selection that favors completeness.
Usage
Command Line
Select-longest-text is always used within a pipeline:
biblicus extract my-corpus --extractor pipeline \
--config 'stages=[{"extractor_id":"ocr-rapidocr"},{"extractor_id":"docling-smol"},{"extractor_id":"select-longest-text"}]'
Configuration File
extractor_id: pipeline
config:
stages:
- extractor_id: ocr-rapidocr
- extractor_id: docling-smol
- extractor_id: select-longest-text # Select longest result
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
corpus = Corpus.from_directory("my-corpus")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "ocr-rapidocr"},
{"extractor_id": "docling-smol"},
{"extractor_id": "select-longest-text"}
]
}
)
Examples
Compare OCR Methods
Try multiple OCR approaches and keep the best:
extractor_id: pipeline
config:
stages:
- extractor_id: ocr-rapidocr
- extractor_id: ocr-paddleocr-vl
- extractor_id: select-longest-text # Keep most complete
Compare VLM Models
Test different VLM models and select the most thorough:
extractor_id: pipeline
config:
stages:
- extractor_id: docling-smol
- extractor_id: docling-granite
- extractor_id: select-longest-text
Maximize Extraction
Try all available methods and keep the most complete:
from biblicus import Corpus
corpus = Corpus.from_directory("complex-docs")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "pdf-text"},
{"extractor_id": "markitdown"},
{"extractor_id": "docling-smol"},
{"extractor_id": "select-longest-text"}
]
}
)
Hybrid Extraction
Combine text extraction with OCR:
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text
- extractor_id: ocr-rapidocr
- extractor_id: select-longest-text
Behavior Details
Length Calculation
Text length is calculated after stripping whitespace. This prevents padding or formatting differences from affecting selection.
Pipeline Position
Select-longest-text should be the last step in a pipeline. All extraction attempts should come before it.
Parallel Extraction
All extractors in the pipeline run on the same item. This differs from select-text which stops at the first success.
Source Tracking
The selected extraction retains its producer_extractor_id and source_stage_index, allowing you to identify which extractor produced the final text.
Performance Consideration
Since all extractors run (not just until first success), this approach is slower but more thorough than select-text.
When to Use Select-Longest-Text
Use select-longest-text when:
You want the most complete extraction
Multiple extractors may produce different results
Completeness is more important than speed
You’re comparing extractor quality
Use select-text when:
Order matters (fast extractors first)
You want to stop at first success
Speed is more important than completeness
Use select-override when:
You want media type-based routing
Last extractor should win for patterns
Use select-smart-override when:
You need intelligent routing
Quality metrics (confidence, length) matter
Best Practices
Combine Similar Extractors
Group extractors that target the same content type:
# OCR comparison
stages:
- ocr-rapidocr
- ocr-paddleocr-vl
- select-longest-text
# VLM comparison
stages:
- docling-smol
- docling-granite
- select-longest-text
Consider Performance Trade-offs
Running all extractors is expensive:
# Expensive but thorough
stages:
- pdf-text # Fast
- markitdown # Moderate
- docling-smol # Slow
- docling-granite # Slower
- select-longest-text
Consider using select-text for performance:
# Faster fallback chain
stages:
- pdf-text
- markitdown
- docling-smol
- select-text # Stop at first success
Always Place Last
Select-longest-text should always be the final step:
stages:
- extractor-1
- extractor-2
- extractor-3
- select-longest-text # Always last
Monitor Extraction Statistics
Track which extractors produce the longest outputs:
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "ocr-rapidocr"},
{"extractor_id": "docling-smol"},
{"extractor_id": "select-longest-text"}
]
}
)
# Check which extractor was selected most often
# (This requires inspecting extraction metadata)
Use Cases
Quality Comparison
Compare different extractors to find the best:
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text
- extractor_id: markitdown
- extractor_id: unstructured
- extractor_id: docling-smol
- extractor_id: select-longest-text
Scanned PDF Processing
Try both text extraction and OCR:
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text # Works for digital PDFs
- extractor_id: ocr-rapidocr # Works for scanned PDFs
- extractor_id: select-longest-text
Maximize Content Extraction
Extract as much text as possible:
from biblicus import Corpus
corpus = Corpus.from_directory("documents")
# Try everything, keep the best
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "pass-through-text"},
{"extractor_id": "pdf-text"},
{"extractor_id": "markitdown"},
{"extractor_id": "ocr-rapidocr"},
{"extractor_id": "docling-smol"},
{"extractor_id": "select-longest-text"}
]
}
)
Benchmark Extractors
Systematically compare extractor performance:
extractor_id: pipeline
config:
stages:
- extractor_id: markitdown
- extractor_id: unstructured
- extractor_id: docling-smol
- extractor_id: select-longest-text
Comparison with Other Selectors
Feature |
select-longest |
select-text |
select-override |
select-smart-override |
|---|---|---|---|---|
Selection |
Longest text |
First usable |
Last for pattern |
Intelligent |
All run |
✅ |
❌ |
✅ |
✅ |
Order matters |
Tie-break only |
✅ |
Partial |
Partial |
Performance |
Slow |
Fast |
Moderate |
Moderate |
Use case |
Quality comparison |
Fast fallback |
Media routing |
Smart routing |
Performance Considerations
All Extractors Run
Unlike select-text, all extractors run regardless of which produces the longest output. This means:
Extraction takes as long as the slowest extractor
API costs are incurred for all API-based extractors
Computational resources are used for all local extractors
When Performance Matters
If speed is critical, consider:
Using select-text instead
Reducing the number of extractors in the pipeline
Using only fast extractors
When Completeness Matters
If quality is critical, select-longest-text is ideal:
All extraction methods are attempted
The most thorough result is selected
No potential text is missed
See Also
extraction.md - Extraction pipeline concepts