Pass-Through Text Extractor

Extractor ID: pass-through-text

Overview

The pass-through text extractor is the simplest extractor in Biblicus. It reads text files directly from the corpus and returns their content without any processing. For Markdown files, it parses and strips front matter, returning only the body content.

This extractor is fundamental to Biblicus workflows as the canonical way to handle text items. It preserves the exact content of text files while providing special handling for Markdown front matter.

Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

pip install biblicus

Supported Media Types

text/plain - Plain text files
text/markdown - Markdown files (with front matter parsing)
text/* - Any text media type

Non-text items are automatically skipped.

Configuration

Config Schema

class PassThroughTextExtractorConfig(BaseModel):
    # This extractor requires no configuration
    pass

Configuration Options

This extractor is intentionally minimal and accepts no configuration options.

Usage

Command Line

Basic Usage

# Extract text files from corpus
biblicus extract my-corpus --extractor pass-through-text

Configuration File

extractor_id: pass-through-text
config: {}

biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract text files
results = corpus.extract_text(extractor_id="pass-through-text")

In Pipeline

Text-First Fallback Chain

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: select-text

Mixed Media Type Routing

extractor_id: select-smart-override
config:
  default_extractor: pass-through-text
  overrides:
    - media_type_pattern: "application/pdf"
      extractor: pdf-text
    - media_type_pattern: "image/.*"
      extractor: ocr-rapidocr

Examples

Extract Text Corpus

Process a corpus containing only text files:

biblicus extract notes-corpus --extractor pass-through-text

Extract Markdown with Front Matter

The extractor automatically handles front matter:

---
title: My Document
tags: [note, draft]
---

This is the body content that is extracted.

Output text:

This is the body content that is extracted.

Mixed Format Pipeline

Use as first stage in a multi-format pipeline:

from biblicus import Corpus

corpus = Corpus.from_directory("mixed-corpus")

# Text files pass through, other formats processed by other extractors
results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "markitdown"},
            {"extractor_id": "select-text"}
        ]
    }
)

Behavior Details

Front Matter Handling

For text/markdown items, the extractor:

Parses YAML front matter enclosed in --- delimiters
Strips the front matter section
Returns only the body content

Front matter metadata is preserved in the catalog but not included in extracted text.

Character Encoding

All text files are decoded as UTF-8. Files with other encodings may produce errors or incorrect output.

Empty Files

Empty text files produce empty extracted text (zero-length string). These are counted in extracted_empty_items statistics.

Performance

Speed: Near-instant (file read only)
Memory: Minimal (one file at a time)
Accuracy: 100% (no processing)

This is the fastest extractor in Biblicus as it performs only file I/O and optional front matter parsing.

Error Handling

Non-Text Items

Non-text items are silently skipped (returns None). This allows the extractor to work safely in pipelines with mixed media types.

Encoding Errors

UTF-8 decoding errors cause per-item failures recorded in errored_items but do not halt the entire extraction snapshot.

Missing Files

Missing corpus files result in standard file I/O errors and are recorded as per-item failures.

Use Cases

Documentation Corpora

Ideal for documentation consisting of Markdown or plain text:

biblicus extract docs-corpus --extractor pass-through-text

Note Collections

Process personal notes or knowledge bases:

biblicus extract notes-corpus --extractor pass-through-text

Source Code Comments

Extract text documentation from code repositories:

biblicus extract code-docs-corpus --extractor pass-through-text

Mixed Pipelines

Use as the fast path for text in heterogeneous corpora:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: unstructured  # Handles everything else
    - extractor_id: select-text