Text & Document Processing
Basic text extraction from structured documents and plain text formats.
Text & Document Extractors
Overview
Text and document extractors handle files with explicit text content or structured document formats. These extractors are ideal for:
PDF documents with text layers
Office documents (DOCX, XLSX, PPTX)
Markdown and HTML files
Plain text files
Web content
Available Extractors
pass-through-text
Returns existing extracted text without re-extraction. Useful for:
Skipping extraction when text already exists
Testing and debugging pipelines
Preserving manually curated text
Installation: Included by default
metadata-text
Extracts text from item metadata (title, tags, keywords). Useful for:
Creating searchable metadata
Generating synthetic corpus entries
Testing without file content
Installation: Included by default
pdf-text
Extracts text from PDF documents using pypdf. Ideal for:
PDFs with selectable text layers
Fast extraction without OCR overhead
Simple document processing
Installation: Included by default
markitdown
Microsoft MarkItDown for Office documents and web content. Supports:
DOCX, XLSX, PPTX
HTML, MHTML
Images (via OCR)
Audio (via transcription)
ZIP archives
Installation: pip install biblicus[markitdown] (Python 3.10+ only)
unstructured
Unstructured.io for complex document parsing. Supports:
DOCX, XLSX, PPTX
Email formats (EML, MSG)
Markdown, HTML, XML
Advanced chunking and partitioning
Installation: pip install biblicus[unstructured]
Choosing an Extractor
Format |
Recommended Extractor |
Alternative |
|---|---|---|
PDF (text layer) |
||
PDF (scanned) |
||
DOCX, XLSX, PPTX |
||
Markdown, HTML |
||
Plain text |
Common Patterns
Fallback Chain
Use select-text to try multiple extractors:
extractor_id: select-text
config:
extractors:
- pdf-text
- markitdown
- unstructured
Metadata + Content
Use pipeline to combine metadata and content:
extractor_id: pipeline
config:
extractors:
- metadata-text
- pdf-text
See Also
OCR Extractors - For scanned documents
VLM Extractors - For complex layouts
Pipeline Utilities - For combining strategies