Metadata Text Extractor
Extractor ID: metadata-text
Category: Text/Document Extractors
Overview
The metadata text extractor generates searchable text representations from catalog item metadata. Instead of processing file content, it creates small, stable text artifacts from titles and tags stored in the corpus catalog.
This extractor is designed for retrieval over non-text items like images, audio, or binary files where the metadata provides the primary semantic signal. It’s also useful for comparing retrieval backends while keeping extraction deterministic and stable.
Installation
No additional dependencies required. This extractor is part of the core Biblicus installation.
pip install biblicus
Supported Media Types
All media types are supported. The extractor processes any catalog item that has metadata (title or tags).
Configuration
Config Schema
class MetadataTextExtractorConfig(BaseModel):
include_title: bool = True # Include item title
include_tags: bool = True # Include tags line
Configuration Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Include the item title as the first line |
|
bool |
|
Include a |
Usage
Command Line
Basic Usage
# Extract metadata as text
biblicus extract my-corpus --extractor metadata-text
Custom Configuration
# Only include titles, skip tags
biblicus extract my-corpus --extractor metadata-text \
--config include_tags=false
Configuration File
extractor_id: metadata-text
config:
include_title: true
include_tags: true
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
# Load corpus
corpus = Corpus.from_directory("my-corpus")
# Extract with defaults
results = corpus.extract_text(extractor_id="metadata-text")
# Extract with custom config
results = corpus.extract_text(
extractor_id="metadata-text",
config={
"include_title": True,
"include_tags": False
}
)
In Pipeline
Metadata Fallback
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: metadata-text # Fallback for non-text items
- extractor_id: select-text
Image Metadata Retrieval
extractor_id: select-smart-override
config:
default_extractor: metadata-text
overrides:
- media_type_pattern: "image/.*"
extractor: metadata-text
Examples
Basic Metadata Extraction
Given a catalog item with metadata:
id: photo-001
media_type: image/jpeg
title: "Sunset over mountains"
tags: ["nature", "landscape", "golden-hour"]
Extracted text:
Sunset over mountains
tags: nature, landscape, golden-hour
Title-Only Extraction
Extract just titles for minimal overhead:
biblicus extract photos-corpus --extractor metadata-text \
--config include_tags=false
Output:
Sunset over mountains
Image Corpus Retrieval
Create searchable text for an image collection:
from biblicus import Corpus
# Corpus of photos with descriptive metadata
corpus = Corpus.from_directory("photos")
# Extract metadata as text for retrieval
results = corpus.extract_text(extractor_id="metadata-text")
Mixed Media Pipeline
Use metadata for items that can’t be processed:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: ocr-rapidocr
- extractor_id: metadata-text # Catch-all for remaining items
- extractor_id: select-text
Output Format
The output is plain text formatted as:
Title line (if
include_titleis true and title exists): The item title as-isTags line (if
include_tagsis true and tags exist):tags: tag1, tag2, tag3
Both elements are optional. If neither title nor tags exist, the extractor returns None.
Behavior Details
Empty Metadata
Items without title or tags (or with both disabled) return None, causing the extractor to skip them.
Whitespace Handling
Titles are stripped of leading/trailing whitespace
Empty titles (only whitespace) are treated as missing
Tags are individually stripped and empty tags are filtered out
Tag Formatting
Tags are joined with commas and spaces: tags: tag1, tag2, tag3
Deterministic Output
Output is completely deterministic based on catalog metadata. No file I/O is performed, making this extractor extremely stable for benchmarking retrieval systems.
Performance
Speed: Near-instant (no file I/O)
Memory: Minimal (metadata only)
Consistency: 100% deterministic
This is one of the fastest extractors as it only accesses in-memory catalog metadata.
Error Handling
Missing Metadata
Items without applicable metadata are silently skipped (returns None).
Invalid Metadata Types
Non-string titles or tags are filtered out. The extractor is defensive against malformed catalog data.
Use Cases
Image Retrieval
Build text indices for image collections:
biblicus extract photos --extractor metadata-text
Audio Library Search
Create searchable text for music or podcast libraries:
biblicus extract music-library --extractor metadata-text
Retrieval Benchmarking
Compare retrieval backends with stable extraction:
from biblicus import Corpus
corpus = Corpus.from_directory("benchmark-corpus")
# Stable extraction for fair comparison
results = corpus.extract_text(extractor_id="metadata-text")
Non-Text Fallback
Provide searchable text for items that can’t be processed:
extractor_id: pipeline
config:
stages:
- extractor_id: docling-smol
- extractor_id: metadata-text # Fallback
- extractor_id: select-text
Best Practices
When to Use Metadata Extractor
Use metadata-text when:
Items have rich, descriptive metadata
You need deterministic extraction for benchmarking
File content is unavailable or unreliable
You want minimal processing overhead
Don’t use metadata-text when:
File content provides the primary signal
Metadata is missing or poor quality
You need full document understanding
Metadata Quality
The effectiveness of this extractor depends entirely on metadata quality:
Good: Descriptive titles, relevant tags
Poor: Generic titles (“IMG_001”), no tags
Catalog Preparation
Ensure your catalog has good metadata:
# Good metadata
id: research-paper-001
title: "Neural Networks for Document Understanding"
tags: ["ml", "nlp", "research", "deep-learning"]
# Poor metadata
id: doc001
title: "Document 1"
tags: []
See Also
extraction.md - Extraction pipeline concepts