Text Extractors

Biblicus provides a plugin-based text extraction system supporting diverse document types, media formats, and processing strategies.

Extractor Categories

Text & Document Processing

Basic text extraction from structured documents and plain text formats.

  • pass-through-text - Returns existing extracted text without re-extraction

  • metadata-text - Extracts text from item metadata (title, tags, etc.)

  • pdf-text - Extracts text from PDF documents using pypdf

  • markitdown - Microsoft MarkItDown for Office documents and web content

  • unstructured - Unstructured.io for complex document parsing

Optical Character Recognition (OCR)

Traditional OCR for extracting text from images and scanned documents.

  • ocr-rapidocr - RapidOCR for fast ONNX-based text recognition

  • ocr-paddleocr-vl - PaddleOCR vision-language model for document understanding

Vision-Language Models (VLM)

Advanced VLM-based document understanding with layout analysis and structured extraction.

Speech-to-Text (STT)

Audio transcription for spoken content in video and audio files.

  • stt-openai - OpenAI Whisper API for audio transcription

  • stt-deepgram - Deepgram Nova-2 for fast, accurate transcription

  • stt-aldea - Aldea Speech-to-Text API for audio transcription

  • deepgram-transform - Render structured Deepgram metadata into text

Pipeline Utilities

Meta-extractors for combining, selecting, and orchestrating extraction strategies.

Quick Start

Installation

Most extractors require optional dependencies:

# Basic text extraction (included by default)
pip install biblicus

# OCR extractors
pip install biblicus[ocr]           # RapidOCR
pip install biblicus[paddleocr]     # PaddleOCR VL

# VLM document understanding
pip install biblicus[docling]       # Docling (Transformers backend)
pip install biblicus[docling-mlx]   # Docling (MLX backend for Apple Silicon)

# Speech-to-text
pip install biblicus[openai]        # OpenAI Whisper
pip install biblicus[deepgram]      # Deepgram Nova-2

# Document processing
pip install biblicus[markitdown]    # MarkItDown (Python 3.10+)
pip install biblicus[unstructured]  # Unstructured.io

Basic Usage

Command Line

# Initialize corpus
biblicus init my-corpus

# Ingest documents
biblicus ingest my-corpus document.pdf

# Extract text with specific extractor
biblicus extract my-corpus --extractor pdf-text

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract text using an extractor
results = corpus.extract_text(extractor_id="pdf-text")

Choosing an Extractor

For PDF Documents

  • Simple PDFs with text layers: Use pdf-text (fast, no dependencies)

  • Scanned PDFs or complex layouts: Use ocr-rapidocr or VLM extractors

  • Tables, equations, complex structure: Use docling-granite

For Office Documents

For Images

For Audio/Video

For Multiple Strategies

See Also

  • extraction.md - Extraction pipeline concepts and architecture

  • API Reference - Python API documentation

  • Repository README - Getting started guide