# Metadata Text Extractor

**Extractor ID:** `metadata-text`

**Category:** [Text/Document Extractors](index.md)

## Overview

The metadata text extractor generates searchable text representations from catalog item metadata. Instead of processing file content, it creates small, stable text artifacts from titles and tags stored in the corpus catalog.

This extractor is designed for retrieval over non-text items like images, audio, or binary files where the metadata provides the primary semantic signal. It's also useful for comparing retrieval backends while keeping extraction deterministic and stable.

## Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

```bash
pip install biblicus
```

## Supported Media Types

All media types are supported. The extractor processes any catalog item that has metadata (title or tags).

## Configuration

### Config Schema

```python
class MetadataTextExtractorConfig(BaseModel):
    include_title: bool = True   # Include item title
    include_tags: bool = True    # Include tags line
```

### Configuration Options

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `include_title` | bool | `true` | Include the item title as the first line |
| `include_tags` | bool | `true` | Include a `tags: ...` line if tags exist |

## Usage

### Command Line

#### Basic Usage

```bash
# Extract metadata as text
biblicus extract my-corpus --extractor metadata-text
```

#### Custom Configuration

```bash
# Only include titles, skip tags
biblicus extract my-corpus --extractor metadata-text \
  --config include_tags=false
```

#### Configuration File

```yaml
extractor_id: metadata-text
config:
  include_title: true
  include_tags: true
```

```bash
biblicus extract my-corpus --configuration configuration.yml
```

### Python API

```python
from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults
results = corpus.extract_text(extractor_id="metadata-text")

# Extract with custom config
results = corpus.extract_text(
    extractor_id="metadata-text",
    config={
        "include_title": True,
        "include_tags": False
    }
)
```

### In Pipeline

#### Metadata Fallback

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: metadata-text  # Fallback for non-text items
    - extractor_id: select-text
```

#### Image Metadata Retrieval

```yaml
extractor_id: select-smart-override
config:
  default_extractor: metadata-text
  overrides:
    - media_type_pattern: "image/.*"
      extractor: metadata-text
```

## Examples

### Basic Metadata Extraction

Given a catalog item with metadata:

```yaml
id: photo-001
media_type: image/jpeg
title: "Sunset over mountains"
tags: ["nature", "landscape", "golden-hour"]
```

Extracted text:
```
Sunset over mountains
tags: nature, landscape, golden-hour
```

### Title-Only Extraction

Extract just titles for minimal overhead:

```bash
biblicus extract photos-corpus --extractor metadata-text \
  --config include_tags=false
```

Output:
```
Sunset over mountains
```

### Image Corpus Retrieval

Create searchable text for an image collection:

```python
from biblicus import Corpus

# Corpus of photos with descriptive metadata
corpus = Corpus.from_directory("photos")

# Extract metadata as text for retrieval
results = corpus.extract_text(extractor_id="metadata-text")
```

### Mixed Media Pipeline

Use metadata for items that can't be processed:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: ocr-rapidocr
    - extractor_id: metadata-text    # Catch-all for remaining items
    - extractor_id: select-text
```

## Output Format

The output is plain text formatted as:

1. **Title line** (if `include_title` is true and title exists): The item title as-is
2. **Tags line** (if `include_tags` is true and tags exist): `tags: tag1, tag2, tag3`

Both elements are optional. If neither title nor tags exist, the extractor returns `None`.

## Behavior Details

### Empty Metadata

Items without title or tags (or with both disabled) return `None`, causing the extractor to skip them.

### Whitespace Handling

- Titles are stripped of leading/trailing whitespace
- Empty titles (only whitespace) are treated as missing
- Tags are individually stripped and empty tags are filtered out

### Tag Formatting

Tags are joined with commas and spaces: `tags: tag1, tag2, tag3`

### Deterministic Output

Output is completely deterministic based on catalog metadata. No file I/O is performed, making this extractor extremely stable for benchmarking retrieval systems.

## Performance

- **Speed**: Near-instant (no file I/O)
- **Memory**: Minimal (metadata only)
- **Consistency**: 100% deterministic

This is one of the fastest extractors as it only accesses in-memory catalog metadata.

## Error Handling

### Missing Metadata

Items without applicable metadata are silently skipped (returns `None`).

### Invalid Metadata Types

Non-string titles or tags are filtered out. The extractor is defensive against malformed catalog data.

## Use Cases

### Image Retrieval

Build text indices for image collections:

```bash
biblicus extract photos --extractor metadata-text
```

### Audio Library Search

Create searchable text for music or podcast libraries:

```bash
biblicus extract music-library --extractor metadata-text
```

### Retrieval Benchmarking

Compare retrieval backends with stable extraction:

```python
from biblicus import Corpus

corpus = Corpus.from_directory("benchmark-corpus")

# Stable extraction for fair comparison
results = corpus.extract_text(extractor_id="metadata-text")
```

### Non-Text Fallback

Provide searchable text for items that can't be processed:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: docling-smol
    - extractor_id: metadata-text  # Fallback
    - extractor_id: select-text
```

## Best Practices

### When to Use Metadata Extractor

**Use metadata-text when:**
- Items have rich, descriptive metadata
- You need deterministic extraction for benchmarking
- File content is unavailable or unreliable
- You want minimal processing overhead

**Don't use metadata-text when:**
- File content provides the primary signal
- Metadata is missing or poor quality
- You need full document understanding

### Metadata Quality

The effectiveness of this extractor depends entirely on metadata quality:

- **Good**: Descriptive titles, relevant tags
- **Poor**: Generic titles ("IMG_001"), no tags

### Catalog Preparation

Ensure your catalog has good metadata:

```yaml
# Good metadata
id: research-paper-001
title: "Neural Networks for Document Understanding"
tags: ["ml", "nlp", "research", "deep-learning"]

# Poor metadata
id: doc001
title: "Document 1"
tags: []
```

## Related Extractors

### Same Category

- [pass-through-text](pass-through.md) - Direct text file reading
- [pdf-text](pdf.md) - PDF text extraction
- [markitdown](markitdown.md) - Office document conversion
- [unstructured](unstructured.md) - Universal document parser

### Alternatives

- [ocr-rapidocr](../ocr/rapidocr.md) - Image text extraction
- [stt-openai](../speech-to-text/openai.md) - Audio transcription
- [docling-smol](../vlm-document/docling-smol.md) - VLM document understanding

### Pipeline Utilities

- [select-text](../pipeline-utilities/select-text.md) - First non-empty selection
- [select-longest-text](../pipeline-utilities/select-longest.md) - Longest output selection
- [pipeline](../pipeline-utilities/pipeline.md) - Multi-step extraction

## See Also

- [Text/Document Extractors Overview](index.md)
- [Extractors Index](../index.md)
- [extraction.md](../../extraction.md) - Extraction pipeline concepts
- [Catalog Specification](../../extraction.md#catalog-format)