# Select Longest Text Extractor

**Extractor ID:** `select-longest-text`

**Category:** [Pipeline Utilities](index.md)

## Overview

The select-longest-text extractor chooses the extraction with the most text from previous pipeline stages. It implements a length-based selection policy for scenarios where multiple extractors may produce different outputs for the same item.

This extractor is useful when you want to maximize extracted content, assuming that longer outputs are more complete. It's ideal for comparing different extraction methods and choosing the one that extracts the most information.

## Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

```bash
pip install biblicus
```

## Supported Media Types

All media types are supported. This extractor operates on previous extraction results, not the raw item.

## Configuration

### Config Schema

```python
class SelectLongestTextExtractorConfig(BaseModel):
    # This extractor requires no configuration
    pass
```

### Configuration Options

This extractor currently accepts no configuration options.

## Selection Rules

The extractor selects text using the following rules:

1. **Longest usable text**: Select the extraction with the greatest character count (after stripping whitespace)
2. **Tie breaking**: If multiple extractions have the same length, select the earliest (lowest stage index)
3. **No usable text**: If all extractions are empty, select the earliest extraction
4. **No extractions**: If no previous extractions exist, return `None`

This provides deterministic selection that favors completeness.

## Usage

### Command Line

Select-longest-text is always used within a pipeline:

```bash
biblicus extract my-corpus --extractor pipeline \
  --config 'stages=[{"extractor_id":"ocr-rapidocr"},{"extractor_id":"docling-smol"},{"extractor_id":"select-longest-text"}]'
```

### Configuration File

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: ocr-rapidocr
    - extractor_id: docling-smol
    - extractor_id: select-longest-text  # Select longest result
```

```bash
biblicus extract my-corpus --configuration configuration.yml
```

### Python API

```python
from biblicus import Corpus

corpus = Corpus.from_directory("my-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "ocr-rapidocr"},
            {"extractor_id": "docling-smol"},
            {"extractor_id": "select-longest-text"}
        ]
    }
)
```

## Examples

### Compare OCR Methods

Try multiple OCR approaches and keep the best:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: ocr-rapidocr
    - extractor_id: ocr-paddleocr-vl
    - extractor_id: select-longest-text  # Keep most complete
```

### Compare VLM Models

Test different VLM models and select the most thorough:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: docling-smol
    - extractor_id: docling-granite
    - extractor_id: select-longest-text
```

### Maximize Extraction

Try all available methods and keep the most complete:

```python
from biblicus import Corpus

corpus = Corpus.from_directory("complex-docs")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pdf-text"},
            {"extractor_id": "markitdown"},
            {"extractor_id": "docling-smol"},
            {"extractor_id": "select-longest-text"}
        ]
    }
)
```

### Hybrid Extraction

Combine text extraction with OCR:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-longest-text
```

## Behavior Details

### Length Calculation

Text length is calculated after stripping whitespace. This prevents padding or formatting differences from affecting selection.

### Pipeline Position

Select-longest-text should be the **last step** in a pipeline. All extraction attempts should come before it.

### Parallel Extraction

All extractors in the pipeline run on the same item. This differs from select-text which stops at the first success.

### Source Tracking

The selected extraction retains its `producer_extractor_id` and `source_stage_index`, allowing you to identify which extractor produced the final text.

### Performance Consideration

Since all extractors run (not just until first success), this approach is slower but more thorough than select-text.

## When to Use Select-Longest-Text

### Use select-longest-text when:
- You want the most complete extraction
- Multiple extractors may produce different results
- Completeness is more important than speed
- You're comparing extractor quality

### Use select-text when:
- Order matters (fast extractors first)
- You want to stop at first success
- Speed is more important than completeness

### Use select-override when:
- You want media type-based routing
- Last extractor should win for patterns

### Use select-smart-override when:
- You need intelligent routing
- Quality metrics (confidence, length) matter

## Best Practices

### Combine Similar Extractors

Group extractors that target the same content type:

```yaml
# OCR comparison
stages:
  - ocr-rapidocr
  - ocr-paddleocr-vl
  - select-longest-text

# VLM comparison
stages:
  - docling-smol
  - docling-granite
  - select-longest-text
```

### Consider Performance Trade-offs

Running all extractors is expensive:

```yaml
# Expensive but thorough
stages:
  - pdf-text           # Fast
  - markitdown         # Moderate
  - docling-smol       # Slow
  - docling-granite    # Slower
  - select-longest-text
```

Consider using select-text for performance:

```yaml
# Faster fallback chain
stages:
  - pdf-text
  - markitdown
  - docling-smol
  - select-text  # Stop at first success
```

### Always Place Last

Select-longest-text should always be the final step:

```yaml
stages:
  - extractor-1
  - extractor-2
  - extractor-3
  - select-longest-text  # Always last
```

### Monitor Extraction Statistics

Track which extractors produce the longest outputs:

```python
results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "ocr-rapidocr"},
            {"extractor_id": "docling-smol"},
            {"extractor_id": "select-longest-text"}
        ]
    }
)

# Check which extractor was selected most often
# (This requires inspecting extraction metadata)
```

## Use Cases

### Quality Comparison

Compare different extractors to find the best:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: markitdown
    - extractor_id: unstructured
    - extractor_id: docling-smol
    - extractor_id: select-longest-text
```

### Scanned PDF Processing

Try both text extraction and OCR:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text       # Works for digital PDFs
    - extractor_id: ocr-rapidocr   # Works for scanned PDFs
    - extractor_id: select-longest-text
```

### Maximize Content Extraction

Extract as much text as possible:

```python
from biblicus import Corpus

corpus = Corpus.from_directory("documents")

# Try everything, keep the best
results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "markitdown"},
            {"extractor_id": "ocr-rapidocr"},
            {"extractor_id": "docling-smol"},
            {"extractor_id": "select-longest-text"}
        ]
    }
)
```

### Benchmark Extractors

Systematically compare extractor performance:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: markitdown
    - extractor_id: unstructured
    - extractor_id: docling-smol
    - extractor_id: select-longest-text
```

## Comparison with Other Selectors

| Feature | select-longest | select-text | select-override | select-smart-override |
|---------|----------------|-------------|-----------------|----------------------|
| Selection | Longest text | First usable | Last for pattern | Intelligent |
| All run | ✅ | ❌ | ✅ | ✅ |
| Order matters | Tie-break only | ✅ | Partial | Partial |
| Performance | Slow | Fast | Moderate | Moderate |
| Use case | Quality comparison | Fast fallback | Media routing | Smart routing |

## Performance Considerations

### All Extractors Run

Unlike select-text, **all extractors run** regardless of which produces the longest output. This means:

- Extraction takes as long as the slowest extractor
- API costs are incurred for all API-based extractors
- Computational resources are used for all local extractors

### When Performance Matters

If speed is critical, consider:
- Using select-text instead
- Reducing the number of extractors in the pipeline
- Using only fast extractors

### When Completeness Matters

If quality is critical, select-longest-text is ideal:
- All extraction methods are attempted
- The most thorough result is selected
- No potential text is missed

## Related Extractors

### Same Category

- [select-text](select-text.md) - First non-empty selection
- [select-override](select-override.md) - Simple override selection
- [select-smart-override](select-smart-override.md) - Intelligent routing
- [pipeline](pipeline.md) - Multi-step extraction

### Frequently Combined With

- [pdf-text](../text-document/pdf.md) - Fast PDF extraction
- [markitdown](../text-document/markitdown.md) - Office documents
- [ocr-rapidocr](../ocr/rapidocr.md) - Fast OCR
- [ocr-paddleocr-vl](../ocr/paddleocr-vl.md) - Advanced OCR
- [docling-smol](../vlm-document/docling-smol.md) - Fast VLM
- [docling-granite](../vlm-document/docling-granite.md) - High-accuracy VLM

## See Also

- [Pipeline Utilities Overview](index.md)
- [Extractors Index](../index.md)
- [extraction.md](../../extraction.md) - Extraction pipeline concepts
- [Pipeline Configuration](../../extraction.md#pipelines)