# Select Text Extractor **Extractor ID:** `select-text` **Category:** [Pipeline Utilities](index.md) ## Overview The select-text extractor chooses the first usable extracted text from previous pipeline stages. It implements a simple, deterministic selection policy for fallback chains where multiple extractors may produce results for the same item. This extractor is fundamental to pipeline composition, enabling graceful fallback patterns where you try fast extractors first and fall back to more powerful (but slower) alternatives. ## Installation No additional dependencies required. This extractor is part of the core Biblicus installation. ```bash pip install biblicus ``` ## Supported Media Types All media types are supported. This extractor operates on previous extraction results, not the raw item. ## Configuration ### Config Schema ```python class SelectTextExtractorConfig(BaseModel): # This extractor requires no configuration pass ``` ### Configuration Options This extractor is intentionally minimal and accepts no configuration options. ## Selection Rules The extractor selects text using the following rules: 1. **Usable text**: Select the first extraction with non-empty text (after stripping whitespace) 2. **Any text**: If no usable text exists, select the first extraction even if empty 3. **No extractions**: If no previous extractions exist, return `None` This ensures deterministic behavior: given the same pipeline order and inputs, the same extraction is always selected. ## Usage ### Command Line Select-text is always used within a pipeline: ```bash biblicus extract my-corpus --extractor pipeline \ --config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"ocr-rapidocr"},{"extractor_id":"select-text"}]' ``` ### Configuration File ```yaml extractor_id: pipeline config: stages: - extractor_id: pdf-text - extractor_id: ocr-rapidocr - extractor_id: select-text # Select first usable result ``` ```bash biblicus extract my-corpus --configuration configuration.yml ``` ### Python API ```python from biblicus import Corpus corpus = Corpus.from_directory("my-corpus") results = corpus.extract_text( extractor_id="pipeline", config={ "stages": [ {"extractor_id": "pdf-text"}, {"extractor_id": "ocr-rapidocr"}, {"extractor_id": "select-text"} ] } ) ``` ## Examples ### Fast-to-Slow Fallback Try fast extraction first, fall back to slower methods: ```yaml extractor_id: pipeline config: stages: - extractor_id: pass-through-text # Fastest - extractor_id: pdf-text # Fast - extractor_id: ocr-rapidocr # Moderate - extractor_id: docling-smol # Slower - extractor_id: select-text # Select first result ``` ### Text-First Strategy Prefer text extraction over OCR: ```yaml extractor_id: pipeline config: stages: - extractor_id: pass-through-text - extractor_id: pdf-text - extractor_id: select-text # Prefer text over OCR ``` ### Multi-Format Corpus Handle diverse document types: ```python from biblicus import Corpus corpus = Corpus.from_directory("mixed-corpus") results = corpus.extract_text( extractor_id="pipeline", config={ "stages": [ {"extractor_id": "pass-through-text"}, {"extractor_id": "pdf-text"}, {"extractor_id": "markitdown"}, {"extractor_id": "unstructured"}, {"extractor_id": "select-text"} ] } ) ``` ### OCR Fallback Chain Try multiple OCR approaches: ```yaml extractor_id: pipeline config: stages: - extractor_id: ocr-rapidocr - extractor_id: ocr-paddleocr-vl - extractor_id: docling-smol - extractor_id: select-text ``` ## Behavior Details ### Pipeline Position Select-text should be the **last step** in a pipeline. All extraction attempts should come before it. ### Empty Results If all previous extractors produce empty text, select-text returns the first empty result (not `None`). This distinguishes "processed but empty" from "not processed." ### Source Tracking The selected extraction retains its `producer_extractor_id` and `source_stage_index`, allowing you to identify which extractor produced the final text. ### Determinism Given the same pipeline configuration and inputs, select-text always produces the same result. This makes it suitable for reproducible research and testing. ## When to Use Select-Text ### Use select-text when: - You want the first successful extraction - Order matters (try cheap extractors first) - You need deterministic, predictable selection - Simplicity is preferred ### Use select-longest-text when: - Multiple extractors may succeed - You want the most complete output - Order doesn't matter ### Use select-override when: - You want to override specific media types - Last extractor should win for certain items ### Use select-smart-override when: - You need intelligent routing by media type - Quality metrics (confidence, length) matter ## Best Practices ### Order Extractors by Speed Put faster extractors first: ```yaml stages: - pass-through-text # Instant - pdf-text # Fast - markitdown # Moderate - docling-smol # Slow - select-text ``` ### Order Extractors by Accuracy Or prioritize accuracy: ```yaml stages: - docling-granite # Best accuracy - docling-smol # Good accuracy - ocr-rapidocr # Basic OCR - select-text ``` ### Always Place Last Select-text should always be the final step: ```yaml stages: - extractor-1 - extractor-2 - extractor-3 - select-text # Always last ``` ### Combine with Media Type Routing Use select-text within smart routing: ```yaml extractor_id: select-smart-override config: default_extractor: pipeline default_config: stages: - extractor_id: pass-through-text - extractor_id: select-text overrides: - media_type_pattern: "application/pdf" extractor: pipeline config: stages: - extractor_id: pdf-text - extractor_id: docling-smol - extractor_id: select-text ``` ## Use Cases ### Heterogeneous Corpus Process corpora with mixed document types: ```yaml extractor_id: pipeline config: stages: - extractor_id: pass-through-text - extractor_id: pdf-text - extractor_id: markitdown - extractor_id: ocr-rapidocr - extractor_id: select-text ``` ### Cost Optimization Try free/cheap methods before expensive APIs: ```yaml extractor_id: pipeline config: stages: - extractor_id: pass-through-text # Free - extractor_id: pdf-text # Free - extractor_id: stt-openai # Paid API - extractor_id: select-text ``` ### Graceful Degradation Provide fallbacks for when preferred extractors fail: ```yaml extractor_id: pipeline config: stages: - extractor_id: docling-granite # Preferred - extractor_id: docling-smol # Fallback 1 - extractor_id: ocr-rapidocr # Fallback 2 - extractor_id: metadata-text # Last resort - extractor_id: select-text ``` ## Comparison with Other Selectors | Feature | select-text | select-longest | select-override | select-smart-override | |---------|-------------|----------------|-----------------|----------------------| | Selection | First usable | Longest text | Last for pattern | Intelligent | | Order matters | ✅ | ❌ | Partial | Partial | | Media type aware | ❌ | ❌ | ✅ | ✅ | | Confidence aware | ❌ | ❌ | ❌ | ✅ | | Complexity | Simple | Simple | Moderate | Complex | ## Related Extractors ### Same Category - [select-longest-text](select-longest.md) - Select longest output - [select-override](select-override.md) - Simple override selection - [select-smart-override](select-smart-override.md) - Intelligent routing - [pipeline](pipeline.md) - Multi-step extraction ### Frequently Combined With - [pass-through-text](../text-document/pass-through.md) - Text file reading - [pdf-text](../text-document/pdf.md) - PDF extraction - [markitdown](../text-document/markitdown.md) - Office documents - [ocr-rapidocr](../ocr/rapidocr.md) - Fast OCR - [docling-smol](../vlm-document/docling-smol.md) - Fast VLM ## See Also - [Pipeline Utilities Overview](index.md) - [Extractors Index](../index.md) - [extraction.md](../../extraction.md) - Extraction pipeline concepts - [Pipeline Configuration](../../extraction.md#pipelines)