Pass-Through Text Extractor

Extractor ID: pass-through-text

Category: Text/Document Extractors

Overview

The pass-through text extractor is the simplest extractor in Biblicus. It reads text files directly from the corpus and returns their content without any processing. For Markdown files, it parses and strips front matter, returning only the body content.

This extractor is fundamental to Biblicus workflows as the canonical way to handle text items. It preserves the exact content of text files while providing special handling for Markdown front matter.

Installation

No additional dependencies required. This extractor is part of the core Biblicus installation.

pip install biblicus

Supported Media Types

  • text/plain - Plain text files

  • text/markdown - Markdown files (with front matter parsing)

  • text/* - Any text media type

Non-text items are automatically skipped.

Configuration

Config Schema

class PassThroughTextExtractorConfig(BaseModel):
    # This extractor requires no configuration
    pass

Configuration Options

This extractor is intentionally minimal and accepts no configuration options.

Usage

Command Line

Basic Usage

# Extract text files from corpus
biblicus extract my-corpus --extractor pass-through-text

Configuration File

extractor_id: pass-through-text
config: {}
biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract text files
results = corpus.extract_text(extractor_id="pass-through-text")

In Pipeline

Text-First Fallback Chain

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: select-text

Mixed Media Type Routing

extractor_id: select-smart-override
config:
  default_extractor: pass-through-text
  overrides:
    - media_type_pattern: "application/pdf"
      extractor: pdf-text
    - media_type_pattern: "image/.*"
      extractor: ocr-rapidocr

Examples

Extract Text Corpus

Process a corpus containing only text files:

biblicus extract notes-corpus --extractor pass-through-text

Extract Markdown with Front Matter

The extractor automatically handles front matter:

---
title: My Document
tags: [note, draft]
---

This is the body content that is extracted.

Output text:

This is the body content that is extracted.

Mixed Format Pipeline

Use as first stage in a multi-format pipeline:

from biblicus import Corpus

corpus = Corpus.from_directory("mixed-corpus")

# Text files pass through, other formats processed by other extractors
results = corpus.extract_text(
    extractor_id="pipeline",
    config={
        "stages": [
            {"extractor_id": "pass-through-text"},
            {"extractor_id": "markitdown"},
            {"extractor_id": "select-text"}
        ]
    }
)

Behavior Details

Front Matter Handling

For text/markdown items, the extractor:

  1. Parses YAML front matter enclosed in --- delimiters

  2. Strips the front matter section

  3. Returns only the body content

Front matter metadata is preserved in the catalog but not included in extracted text.

Character Encoding

All text files are decoded as UTF-8. Files with other encodings may produce errors or incorrect output.

Empty Files

Empty text files produce empty extracted text (zero-length string). These are counted in extracted_empty_items statistics.

Performance

  • Speed: Near-instant (file read only)

  • Memory: Minimal (one file at a time)

  • Accuracy: 100% (no processing)

This is the fastest extractor in Biblicus as it performs only file I/O and optional front matter parsing.

Error Handling

Non-Text Items

Non-text items are silently skipped (returns None). This allows the extractor to work safely in pipelines with mixed media types.

Encoding Errors

UTF-8 decoding errors cause per-item failures recorded in errored_items but do not halt the entire extraction snapshot.

Missing Files

Missing corpus files result in standard file I/O errors and are recorded as per-item failures.

Use Cases

Documentation Corpora

Ideal for documentation consisting of Markdown or plain text:

biblicus extract docs-corpus --extractor pass-through-text

Note Collections

Process personal notes or knowledge bases:

biblicus extract notes-corpus --extractor pass-through-text

Source Code Comments

Extract text documentation from code repositories:

biblicus extract code-docs-corpus --extractor pass-through-text

Mixed Pipelines

Use as the fast path for text in heterogeneous corpora:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: unstructured  # Handles everything else
    - extractor_id: select-text

See Also