Text Extractors
Biblicus provides a plugin-based text extraction system supporting diverse document types, media formats, and processing strategies.
Extractor Families
Extractor Categories
Text & Document Processing
Basic text extraction from structured documents and plain text formats.
pass-through-text - Returns existing extracted text without re-extraction
metadata-text - Extracts text from item metadata (title, tags, etc.)
pdf-text - Extracts text from PDF documents using pypdf
markitdown - Microsoft MarkItDown for Office documents and web content
unstructured - Unstructured.io for complex document parsing
Optical Character Recognition (OCR)
Traditional OCR for extracting text from images and scanned documents.
ocr-rapidocr - RapidOCR for fast ONNX-based text recognition
ocr-paddleocr-vl - PaddleOCR vision-language model for document understanding
Vision-Language Models (VLM)
Advanced VLM-based document understanding with layout analysis and structured extraction.
docling-smol - SmolDocling-256M for fast document processing
docling-granite - Granite Docling-258M for high-accuracy extraction
Speech-to-Text (STT)
Audio transcription for spoken content in video and audio files.
stt-openai - OpenAI Whisper API for audio transcription
stt-deepgram - Deepgram Nova-2 for fast, accurate transcription
stt-aldea - Aldea Speech-to-Text API for audio transcription
deepgram-transform - Render structured Deepgram metadata into text
Pipeline Utilities
Meta-extractors for combining, selecting, and orchestrating extraction strategies.
select-text - Selects first successful extractor from a list
select-longest-text - Selects longest output from multiple extractors
select-override - Overrides extraction for specific items by ID
select-smart-override - Overrides extraction based on media type patterns
pipeline - Chains multiple extractors sequentially
Quick Start
Installation
Most extractors require optional dependencies:
# Basic text extraction (included by default)
pip install biblicus
# OCR extractors
pip install biblicus[ocr] # RapidOCR
pip install biblicus[paddleocr] # PaddleOCR VL
# VLM document understanding
pip install biblicus[docling] # Docling (Transformers backend)
pip install biblicus[docling-mlx] # Docling (MLX backend for Apple Silicon)
# Speech-to-text
pip install biblicus[openai] # OpenAI Whisper
pip install biblicus[deepgram] # Deepgram Nova-2
# Document processing
pip install biblicus[markitdown] # MarkItDown (Python 3.10+)
pip install biblicus[unstructured] # Unstructured.io
Basic Usage
Command Line
# Initialize corpus
biblicus init my-corpus
# Ingest documents
biblicus ingest my-corpus document.pdf
# Extract text with specific extractor
biblicus extract my-corpus --extractor pdf-text
Python API
from biblicus import Corpus
# Load corpus
corpus = Corpus.from_directory("my-corpus")
# Extract text using an extractor
results = corpus.extract_text(extractor_id="pdf-text")
Choosing an Extractor
For PDF Documents
Simple PDFs with text layers: Use pdf-text (fast, no dependencies)
Scanned PDFs or complex layouts: Use ocr-rapidocr or VLM extractors
Tables, equations, complex structure: Use docling-granite
For Office Documents
DOCX, XLSX, PPTX: Use markitdown or unstructured
Complex layouts or scanned documents: Use VLM extractors
For Images
Simple text recognition: Use ocr-rapidocr
Complex documents in images: Use ocr-paddleocr-vl or VLM extractors
For Audio/Video
High accuracy, cost-effective: Use stt-deepgram
OpenAI ecosystem integration: Use stt-openai
For Multiple Strategies
Fallback chain: Use select-text
Best output selection: Use select-longest-text
Per-item overrides: Use select-override or select-smart-override
See Also
extraction.md - Extraction pipeline concepts and architecture
API Reference - Python API documentation
Repository README - Getting started guide