# Select Override Extractor

**Extractor ID:** `select-override`

**Category:** [Pipeline Utilities](index.md)

## Overview

The select-override extractor implements simple media type-based routing by always using the last extraction for matching items. It provides basic override logic where specific media types get special handling while others follow default behavior.

This extractor is useful when you want to override extraction results for specific media types, such as always using OCR output for images or VLM output for PDFs, regardless of what other extractors produced.

## Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

```bash
pip install biblicus
```

## Supported Media Types

All media types are supported. Selection behavior depends on configured media type patterns.

## Configuration

### Config Schema

```python
class SelectOverrideConfig(BaseModel):
    media_type_patterns: List[str] = ["*/*"]
    fallback_to_first: bool = False
```

### Configuration Options

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `media_type_patterns` | list[str] | `["*/*"]` | Glob patterns for media types to override |
| `fallback_to_first` | bool | `false` | If true, use first extraction for non-matching types |

### Pattern Matching

Patterns use standard glob syntax:
- `*/*` - Matches all media types
- `image/*` - Matches all image types
- `application/pdf` - Matches only PDF
- `audio/*` - Matches all audio types

## Selection Rules

The extractor selects text using the following rules:

1. **Pattern match**: If item media type matches any pattern, use **last** extraction
2. **No match + fallback_to_first=true**: Use **first** extraction
3. **No match + fallback_to_first=false**: Use **last** extraction
4. **No extractions**: Return `None`

## Usage

### Command Line

Select-override is always used within a pipeline:

```bash
biblicus extract my-corpus --extractor pipeline \
  --config 'stages=[{"extractor_id":"pdf-text"},{"extractor_id":"ocr-rapidocr"},{"extractor_id":"select-override","config":{"media_type_patterns":["image/*"]}}]'
```

### Configuration File

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-override
      config:
        media_type_patterns: ["image/*"]
        fallback_to_first: true
```

```bash
biblicus extract my-corpus --configuration configuration.yml
```

### Python API

```python
from biblicus import Corpus

corpus = Corpus.from_directory("my-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pdf-text"},
            {"extractor_id": "ocr-rapidocr"},
            {
                "extractor_id": "select-override",
                "config": {
                    "media_type_patterns": ["image/*"],
                    "fallback_to_first": True
                }
            }
        ]
    }
)
```

## Examples

### Override Images Only

Use OCR for images, text extraction for everything else:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-override
      config:
        media_type_patterns: ["image/*"]
        fallback_to_first: true
```

For images: Uses ocr-rapidocr (last)
For PDFs: Uses pdf-text (first, due to fallback_to_first=true)

### Override PDFs

Use VLM for PDFs, basic extraction for everything else:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: docling-smol
    - extractor_id: select-override
      config:
        media_type_patterns: ["application/pdf"]
        fallback_to_first: true
```

For PDFs: Uses docling-smol (last matching)
For text files: Uses pass-through-text (first, due to fallback)

### Override Multiple Types

Override specific types with different extractors:

```python
from biblicus import Corpus

corpus = Corpus.from_directory("mixed-corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "markitdown"},
            {"extractor_id": "ocr-rapidocr"},
            {
                "extractor_id": "select-override",
                "config": {
                    "media_type_patterns": ["image/*", "application/pdf"],
                    "fallback_to_first": True
                }
            }
        ]
    }
)
```

### Always Use Last

Default behavior (no fallback):

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: docling-smol
    - extractor_id: select-override  # Uses last for all types
```

## Behavior Details

### Pattern Matching

Uses Python's `fnmatch` for glob pattern matching:

```python
# Exact match
"application/pdf" matches "application/pdf" only

# Wildcard
"image/*" matches "image/png", "image/jpeg", etc.

# Universal
"*/*" matches all media types
```

### Last Wins

For matching types, the **last** extraction is always used, regardless of whether earlier extractions exist or are non-empty.

### Fallback Behavior

When `fallback_to_first=true` and media type doesn't match:
- Use **first** extraction instead of last
- Useful for preferring fast extractors for non-override types

When `fallback_to_first=false` (default):
- Use **last** extraction for everything
- Simpler logic, fewer cases

### Pipeline Position

Select-override should be the **last step** in a pipeline. All extraction attempts should come before it.

## When to Use Select-Override

### Use select-override when:
- You want simple media type-based routing
- Last extractor should always win for specific types
- You need basic override logic
- Simplicity is important

### Use select-smart-override when:
- You need confidence-based selection
- Intelligent fallback is desired
- Quality metrics matter

### Use select-text when:
- Order matters (fast first)
- First success is preferred
- No media type routing needed

### Use select-longest-text when:
- Longest output is preferred
- No routing needed

## Best Practices

### Place Override Extractors Last

Put the extractor you want to use for overrides at the end:

```yaml
stages:
  - extractor_id: pdf-text        # Default for PDFs
  - extractor_id: docling-smol    # Override for PDFs
  - extractor_id: select-override
    config:
      media_type_patterns: ["application/pdf"]
```

### Use Fallback for Efficiency

Enable fallback to prefer fast extractors for non-override types:

```yaml
config:
  media_type_patterns: ["image/*"]
  fallback_to_first: true  # Use fast extractors for non-images
```

### Be Specific with Patterns

Use specific patterns to avoid unintended matches:

```yaml
# Good - specific
media_type_patterns: ["image/png", "image/jpeg"]

# Careful - broad
media_type_patterns: ["image/*"]

# Very broad
media_type_patterns: ["*/*"]
```

### Always Place Last

Select-override should always be the final step:

```yaml
stages:
  - extractor-1
  - extractor-2
  - extractor-3
  - select-override  # Always last
```

## Use Cases

### Image-Specific Processing

Use advanced OCR for images, basic extraction for documents:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: ocr-paddleocr-vl  # For images
    - extractor_id: select-override
      config:
        media_type_patterns: ["image/*"]
        fallback_to_first: true
```

### PDF Override

Use VLM for PDFs, simpler extractors for other types:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: markitdown
    - extractor_id: docling-smol      # For PDFs
    - extractor_id: select-override
      config:
        media_type_patterns: ["application/pdf"]
        fallback_to_first: true
```

### Multi-Type Override

Override multiple specific types:

```python
from biblicus import Corpus

corpus = Corpus.from_directory("corpus")

results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "pdf-text"},
            {"extractor_id": "ocr-rapidocr"},
            {
                "extractor_id": "select-override",
                "config": {
                    "media_type_patterns": ["image/*", "application/pdf"],
                    "fallback_to_first": True
                }
            }
        ]
    }
)
```

## Comparison with Other Selectors

| Feature | select-override | select-text | select-longest | select-smart-override |
|---------|----------------|-------------|----------------|----------------------|
| Selection | Last for pattern | First usable | Longest | Intelligent |
| Media type aware | ✅ | ❌ | ❌ | ✅ |
| Confidence aware | ❌ | ❌ | ❌ | ✅ |
| Quality aware | ❌ | ❌ | ✅ | ✅ |
| Complexity | Simple | Simple | Simple | Complex |
| Override control | Last only | None | None | Configurable |

## Related Extractors

### Same Category

- [select-text](select-text.md) - First non-empty selection
- [select-longest-text](select-longest.md) - Longest output selection
- [select-smart-override](select-smart-override.md) - Intelligent routing
- [pipeline](pipeline.md) - Multi-step extraction

### Frequently Combined With

- [pass-through-text](../text-document/pass-through.md) - Text files
- [pdf-text](../text-document/pdf.md) - PDF extraction
- [markitdown](../text-document/markitdown.md) - Office documents
- [ocr-rapidocr](../ocr/rapidocr.md) - Fast OCR
- [docling-smol](../vlm-document/docling-smol.md) - VLM extraction

## See Also

- [Pipeline Utilities Overview](index.md)
- [Extractors Index](../index.md)
- [extraction.md](../../extraction.md) - Extraction pipeline concepts
- [Pipeline Configuration](../../extraction.md#pipelines)