# Pass-Through Text Extractor **Extractor ID:** `pass-through-text` **Category:** [Text/Document Extractors](index.md) ## Overview The pass-through text extractor is the simplest extractor in Biblicus. It reads text files directly from the corpus and returns their content without any processing. For Markdown files, it parses and strips front matter, returning only the body content. This extractor is fundamental to Biblicus workflows as the canonical way to handle text items. It preserves the exact content of text files while providing special handling for Markdown front matter. ## Installation No additional dependencies required. This extractor is part of the core Biblicus installation. ```bash pip install biblicus ``` ## Supported Media Types - `text/plain` - Plain text files - `text/markdown` - Markdown files (with front matter parsing) - `text/*` - Any text media type Non-text items are automatically skipped. ## Configuration ### Config Schema ```python class PassThroughTextExtractorConfig(BaseModel): # This extractor requires no configuration pass ``` ### Configuration Options This extractor is intentionally minimal and accepts no configuration options. ## Usage ### Command Line #### Basic Usage ```bash # Extract text files from corpus biblicus extract my-corpus --extractor pass-through-text ``` #### Configuration File ```yaml extractor_id: pass-through-text config: {} ``` ```bash biblicus extract my-corpus --configuration configuration.yml ``` ### Python API ```python from biblicus import Corpus # Load corpus corpus = Corpus.from_directory("my-corpus") # Extract text files results = corpus.extract_text(extractor_id="pass-through-text") ``` ### In Pipeline #### Text-First Fallback Chain ```yaml extractor_id: pipeline config: stages: - extractor_id: pass-through-text - extractor_id: pdf-text - extractor_id: select-text ``` #### Mixed Media Type Routing ```yaml extractor_id: select-smart-override config: default_extractor: pass-through-text overrides: - media_type_pattern: "application/pdf" extractor: pdf-text - media_type_pattern: "image/.*" extractor: ocr-rapidocr ``` ## Examples ### Extract Text Corpus Process a corpus containing only text files: ```bash biblicus extract notes-corpus --extractor pass-through-text ``` ### Extract Markdown with Front Matter The extractor automatically handles front matter: ```markdown --- title: My Document tags: [note, draft] --- This is the body content that is extracted. ``` Output text: ``` This is the body content that is extracted. ``` ### Mixed Format Pipeline Use as first stage in a multi-format pipeline: ```python from biblicus import Corpus corpus = Corpus.from_directory("mixed-corpus") # Text files pass through, other formats processed by other extractors results = corpus.extract_text( extractor_id="pipeline", config={ "stages": [ {"extractor_id": "pass-through-text"}, {"extractor_id": "markitdown"}, {"extractor_id": "select-text"} ] } ) ``` ## Behavior Details ### Front Matter Handling For `text/markdown` items, the extractor: 1. Parses YAML front matter enclosed in `---` delimiters 2. Strips the front matter section 3. Returns only the body content Front matter metadata is preserved in the catalog but not included in extracted text. ### Character Encoding All text files are decoded as UTF-8. Files with other encodings may produce errors or incorrect output. ### Empty Files Empty text files produce empty extracted text (zero-length string). These are counted in `extracted_empty_items` statistics. ## Performance - **Speed**: Near-instant (file read only) - **Memory**: Minimal (one file at a time) - **Accuracy**: 100% (no processing) This is the fastest extractor in Biblicus as it performs only file I/O and optional front matter parsing. ## Error Handling ### Non-Text Items Non-text items are silently skipped (returns `None`). This allows the extractor to work safely in pipelines with mixed media types. ### Encoding Errors UTF-8 decoding errors cause per-item failures recorded in `errored_items` but do not halt the entire extraction snapshot. ### Missing Files Missing corpus files result in standard file I/O errors and are recorded as per-item failures. ## Use Cases ### Documentation Corpora Ideal for documentation consisting of Markdown or plain text: ```bash biblicus extract docs-corpus --extractor pass-through-text ``` ### Note Collections Process personal notes or knowledge bases: ```bash biblicus extract notes-corpus --extractor pass-through-text ``` ### Source Code Comments Extract text documentation from code repositories: ```bash biblicus extract code-docs-corpus --extractor pass-through-text ``` ### Mixed Pipelines Use as the fast path for text in heterogeneous corpora: ```yaml extractor_id: pipeline config: stages: - extractor_id: pass-through-text - extractor_id: unstructured # Handles everything else - extractor_id: select-text ``` ## Related Extractors ### Same Category - [metadata-text](metadata.md) - Metadata-based text representation - [pdf-text](pdf.md) - PDF text extraction - [markitdown](markitdown.md) - Office document conversion - [unstructured](unstructured.md) - Universal document parser ### Pipeline Utilities - [select-text](../pipeline-utilities/select-text.md) - First non-empty selection - [select-longest-text](../pipeline-utilities/select-longest.md) - Longest output selection - [pipeline](../pipeline-utilities/pipeline.md) - Multi-step extraction ## See Also - [Text/Document Extractors Overview](index.md) - [Extractors Index](../index.md) - [extraction.md](../../extraction.md) - Extraction pipeline concepts - [Front Matter Documentation](../../extraction.md#front-matter-handling)