Metadata Text Extractor

Extractor ID: metadata-text

Category: Text/Document Extractors

Overview

The metadata text extractor generates searchable text representations from catalog item metadata. Instead of processing file content, it creates small, stable text artifacts from titles and tags stored in the corpus catalog.

This extractor is designed for retrieval over non-text items like images, audio, or binary files where the metadata provides the primary semantic signal. It’s also useful for comparing retrieval backends while keeping extraction deterministic and stable.

Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

pip install biblicus

Supported Media Types

All media types are supported. The extractor processes any catalog item that has metadata (title or tags).

Configuration

Config Schema

class MetadataTextExtractorConfig(BaseModel):
    include_title: bool = True   # Include item title
    include_tags: bool = True    # Include tags line

Configuration Options

Option

Type

Default

Description

include_title

bool

true

Include the item title as the first line

include_tags

bool

true

Include a tags: ... line if tags exist

Usage

Command Line

Basic Usage

# Extract metadata as text
biblicus extract my-corpus --extractor metadata-text

Custom Configuration

# Only include titles, skip tags
biblicus extract my-corpus --extractor metadata-text \
  --config include_tags=false

Configuration File

extractor_id: metadata-text
config:
  include_title: true
  include_tags: true
biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults
results = corpus.extract_text(extractor_id="metadata-text")

# Extract with custom config
results = corpus.extract_text(
    extractor_id="metadata-text",
    config={
        "include_title": True,
        "include_tags": False
    }
)

In Pipeline

Metadata Fallback

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: metadata-text  # Fallback for non-text items
    - extractor_id: select-text

Image Metadata Retrieval

extractor_id: select-smart-override
config:
  default_extractor: metadata-text
  overrides:
    - media_type_pattern: "image/.*"
      extractor: metadata-text

Examples

Basic Metadata Extraction

Given a catalog item with metadata:

id: photo-001
media_type: image/jpeg
title: "Sunset over mountains"
tags: ["nature", "landscape", "golden-hour"]

Extracted text:

Sunset over mountains
tags: nature, landscape, golden-hour

Title-Only Extraction

Extract just titles for minimal overhead:

biblicus extract photos-corpus --extractor metadata-text \
  --config include_tags=false

Output:

Sunset over mountains

Image Corpus Retrieval

Create searchable text for an image collection:

from biblicus import Corpus

# Corpus of photos with descriptive metadata
corpus = Corpus.from_directory("photos")

# Extract metadata as text for retrieval
results = corpus.extract_text(extractor_id="metadata-text")

Mixed Media Pipeline

Use metadata for items that can’t be processed:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: ocr-rapidocr
    - extractor_id: metadata-text    # Catch-all for remaining items
    - extractor_id: select-text

Output Format

The output is plain text formatted as:

  1. Title line (if include_title is true and title exists): The item title as-is

  2. Tags line (if include_tags is true and tags exist): tags: tag1, tag2, tag3

Both elements are optional. If neither title nor tags exist, the extractor returns None.

Behavior Details

Empty Metadata

Items without title or tags (or with both disabled) return None, causing the extractor to skip them.

Whitespace Handling

  • Titles are stripped of leading/trailing whitespace

  • Empty titles (only whitespace) are treated as missing

  • Tags are individually stripped and empty tags are filtered out

Tag Formatting

Tags are joined with commas and spaces: tags: tag1, tag2, tag3

Deterministic Output

Output is completely deterministic based on catalog metadata. No file I/O is performed, making this extractor extremely stable for benchmarking retrieval systems.

Performance

  • Speed: Near-instant (no file I/O)

  • Memory: Minimal (metadata only)

  • Consistency: 100% deterministic

This is one of the fastest extractors as it only accesses in-memory catalog metadata.

Error Handling

Missing Metadata

Items without applicable metadata are silently skipped (returns None).

Invalid Metadata Types

Non-string titles or tags are filtered out. The extractor is defensive against malformed catalog data.

Use Cases

Image Retrieval

Build text indices for image collections:

biblicus extract photos --extractor metadata-text

Retrieval Benchmarking

Compare retrieval backends with stable extraction:

from biblicus import Corpus

corpus = Corpus.from_directory("benchmark-corpus")

# Stable extraction for fair comparison
results = corpus.extract_text(extractor_id="metadata-text")

Non-Text Fallback

Provide searchable text for items that can’t be processed:

extractor_id: pipeline
config:
  stages:
    - extractor_id: docling-smol
    - extractor_id: metadata-text  # Fallback
    - extractor_id: select-text

Best Practices

When to Use Metadata Extractor

Use metadata-text when:

  • Items have rich, descriptive metadata

  • You need deterministic extraction for benchmarking

  • File content is unavailable or unreliable

  • You want minimal processing overhead

Don’t use metadata-text when:

  • File content provides the primary signal

  • Metadata is missing or poor quality

  • You need full document understanding

Metadata Quality

The effectiveness of this extractor depends entirely on metadata quality:

  • Good: Descriptive titles, relevant tags

  • Poor: Generic titles (“IMG_001”), no tags

Catalog Preparation

Ensure your catalog has good metadata:

# Good metadata
id: research-paper-001
title: "Neural Networks for Document Understanding"
tags: ["ml", "nlp", "research", "deep-learning"]

# Poor metadata
id: doc001
title: "Document 1"
tags: []

See Also