Text & Document Processing

Basic text extraction from structured documents and plain text formats.

Overview

Text and document extractors handle files with explicit text content or structured document formats. These extractors are ideal for:

  • PDF documents with text layers

  • Office documents (DOCX, XLSX, PPTX)

  • Markdown and HTML files

  • Plain text files

  • Web content

Available Extractors

pass-through-text

Returns existing extracted text without re-extraction. Useful for:

  • Skipping extraction when text already exists

  • Testing and debugging pipelines

  • Preserving manually curated text

Installation: Included by default

metadata-text

Extracts text from item metadata (title, tags, keywords). Useful for:

  • Creating searchable metadata

  • Generating synthetic corpus entries

  • Testing without file content

Installation: Included by default

pdf-text

Extracts text from PDF documents using pypdf. Ideal for:

  • PDFs with selectable text layers

  • Fast extraction without OCR overhead

  • Simple document processing

Installation: Included by default

markitdown

Microsoft MarkItDown for Office documents and web content. Supports:

  • DOCX, XLSX, PPTX

  • HTML, MHTML

  • Images (via OCR)

  • Audio (via transcription)

  • ZIP archives

Installation: pip install biblicus[markitdown] (Python 3.10+ only)

unstructured

Unstructured.io for complex document parsing. Supports:

  • DOCX, XLSX, PPTX

  • Email formats (EML, MSG)

  • Markdown, HTML, XML

  • Advanced chunking and partitioning

Installation: pip install biblicus[unstructured]

Choosing an Extractor

Format

Recommended Extractor

Alternative

PDF (text layer)

pdf-text

markitdown

PDF (scanned)

See OCR or VLM

DOCX, XLSX, PPTX

markitdown

unstructured

Markdown, HTML

markitdown

unstructured

Plain text

pass-through-text

metadata-text

Common Patterns

Fallback Chain

Use select-text to try multiple extractors:

extractor_id: select-text
config:
  extractors:
    - pdf-text
    - markitdown
    - unstructured

Metadata + Content

Use pipeline to combine metadata and content:

extractor_id: pipeline
config:
  extractors:
    - metadata-text
    - pdf-text

See Also