Metadata Text Extractor

Extractor ID: metadata-text

Overview

The metadata text extractor generates searchable text representations from catalog item metadata. Instead of processing file content, it creates small, stable text artifacts from titles and tags stored in the corpus catalog.

This extractor is designed for retrieval over non-text items like images, audio, or binary files where the metadata provides the primary semantic signal. It’s also useful for comparing retrieval backends while keeping extraction deterministic and stable.

Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

pip install biblicus

Supported Media Types

All media types are supported. The extractor processes any catalog item that has metadata (title or tags).

Configuration

Config Schema

class MetadataTextExtractorConfig(BaseModel):
    include_title: bool = True   # Include item title
    include_tags: bool = True    # Include tags line

Configuration Options

Option	Type	Default	Description
`include_title`	bool	`true`	Include the item title as the first line
`include_tags`	bool	`true`	Include a `tags: ...` line if tags exist

Usage

Command Line

Basic Usage

# Extract metadata as text
biblicus extract my-corpus --extractor metadata-text

Custom Configuration

# Only include titles, skip tags
biblicus extract my-corpus --extractor metadata-text \
  --config include_tags=false

Configuration File

extractor_id: metadata-text
config:
  include_title: true
  include_tags: true

biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults
results = corpus.extract_text(extractor_id="metadata-text")

# Extract with custom config
results = corpus.extract_text(
    extractor_id="metadata-text",
    config={
        "include_title": True,
        "include_tags": False
    }
)

In Pipeline

Metadata Fallback

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: metadata-text  # Fallback for non-text items
    - extractor_id: select-text

Image Metadata Retrieval

extractor_id: select-smart-override
config:
  default_extractor: metadata-text
  overrides:
    - media_type_pattern: "image/.*"
      extractor: metadata-text

Examples

Basic Metadata Extraction

Given a catalog item with metadata:

id: photo-001
media_type: image/jpeg
title: "Sunset over mountains"
tags: ["nature", "landscape", "golden-hour"]

Extracted text:

Sunset over mountains
tags: nature, landscape, golden-hour

Title-Only Extraction

Extract just titles for minimal overhead:

biblicus extract photos-corpus --extractor metadata-text \
  --config include_tags=false

Output:

Sunset over mountains

Image Corpus Retrieval

Create searchable text for an image collection:

from biblicus import Corpus

# Corpus of photos with descriptive metadata
corpus = Corpus.from_directory("photos")

# Extract metadata as text for retrieval
results = corpus.extract_text(extractor_id="metadata-text")

Mixed Media Pipeline

Use metadata for items that can’t be processed:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: ocr-rapidocr
    - extractor_id: metadata-text    # Catch-all for remaining items
    - extractor_id: select-text

Output Format

The output is plain text formatted as:

Title line (if include_title is true and title exists): The item title as-is
Tags line (if include_tags is true and tags exist): tags: tag1, tag2, tag3

Both elements are optional. If neither title nor tags exist, the extractor returns None.

Behavior Details

Empty Metadata

Items without title or tags (or with both disabled) return None, causing the extractor to skip them.

Whitespace Handling

Titles are stripped of leading/trailing whitespace
Empty titles (only whitespace) are treated as missing
Tags are individually stripped and empty tags are filtered out

Tag Formatting

Tags are joined with commas and spaces: tags: tag1, tag2, tag3

Deterministic Output

Output is completely deterministic based on catalog metadata. No file I/O is performed, making this extractor extremely stable for benchmarking retrieval systems.

Performance

Speed: Near-instant (no file I/O)
Memory: Minimal (metadata only)
Consistency: 100% deterministic

This is one of the fastest extractors as it only accesses in-memory catalog metadata.

Error Handling

Missing Metadata

Items without applicable metadata are silently skipped (returns None).

Invalid Metadata Types

Non-string titles or tags are filtered out. The extractor is defensive against malformed catalog data.

Use Cases

Image Retrieval

Build text indices for image collections:

biblicus extract photos --extractor metadata-text

Audio Library Search

Create searchable text for music or podcast libraries:

biblicus extract music-library --extractor metadata-text

Retrieval Benchmarking

Compare retrieval backends with stable extraction:

from biblicus import Corpus

corpus = Corpus.from_directory("benchmark-corpus")

# Stable extraction for fair comparison
results = corpus.extract_text(extractor_id="metadata-text")

Non-Text Fallback

Provide searchable text for items that can’t be processed:

extractor_id: pipeline
config:
  stages:
    - extractor_id: docling-smol
    - extractor_id: metadata-text  # Fallback
    - extractor_id: select-text

Best Practices

When to Use Metadata Extractor

Use metadata-text when:

Items have rich, descriptive metadata
You need deterministic extraction for benchmarking
File content is unavailable or unreliable
You want minimal processing overhead

Don’t use metadata-text when:

File content provides the primary signal
Metadata is missing or poor quality
You need full document understanding

Metadata Quality

The effectiveness of this extractor depends entirely on metadata quality:

Good: Descriptive titles, relevant tags
Poor: Generic titles (“IMG_001”), no tags

Catalog Preparation

Ensure your catalog has good metadata:

# Good metadata
id: research-paper-001
title: "Neural Networks for Document Understanding"
tags: ["ml", "nlp", "research", "deep-learning"]

# Poor metadata
id: doc001
title: "Document 1"
tags: []