Text & Document Processing

Basic text extraction from structured documents and plain text formats.

Text & Document Extractors

Overview

Text and document extractors handle files with explicit text content or structured document formats. These extractors are ideal for:

PDF documents with text layers
Office documents (DOCX, XLSX, PPTX)
Markdown and HTML files
Plain text files
Web content

Available Extractors

pass-through-text 

Returns existing extracted text without re-extraction. Useful for:

Skipping extraction when text already exists
Testing and debugging pipelines
Preserving manually curated text

Installation: Included by default

metadata-text 

Extracts text from item metadata (title, tags, keywords). Useful for:

Creating searchable metadata
Generating synthetic corpus entries
Testing without file content

Installation: Included by default

pdf-text 

Extracts text from PDF documents using pypdf. Ideal for:

PDFs with selectable text layers
Fast extraction without OCR overhead
Simple document processing

Installation: Included by default

markitdown 

Microsoft MarkItDown for Office documents and web content. Supports:

DOCX, XLSX, PPTX
HTML, MHTML
Images (via OCR)
Audio (via transcription)
ZIP archives

Installation: pip install biblicus[markitdown] (Python 3.10+ only)

unstructured 

Unstructured.io for complex document parsing. Supports:

DOCX, XLSX, PPTX
Email formats (EML, MSG)
Markdown, HTML, XML
Advanced chunking and partitioning

Installation: pip install biblicus[unstructured]

Choosing an Extractor

Format	Recommended Extractor	Alternative
PDF (text layer)	pdf-text	markitdown
PDF (scanned)	See OCR or VLM
DOCX, XLSX, PPTX	markitdown	unstructured
Markdown, HTML	markitdown	unstructured
Plain text	pass-through-text	metadata-text

Common Patterns

Fallback Chain

Use select-text to try multiple extractors:

extractor_id: select-text
config:
  extractors:
    - pdf-text
    - markitdown
    - unstructured

Metadata + Content

Use pipeline to combine metadata and content:

extractor_id: pipeline
config:
  extractors:
    - metadata-text
    - pdf-text

Text & Document Processing

Overview

Available Extractors

pass-through-text

metadata-text

pdf-text

markitdown

unstructured

Choosing an Extractor

Common Patterns

Fallback Chain

Metadata + Content

See Also

pass-through-text 

metadata-text 

pdf-text 

markitdown 

unstructured 