# Text Extractors

Biblicus provides a plugin-based text extraction system supporting diverse document types, media formats, and processing strategies.

```{toctree}
:maxdepth: 1
:caption: Extractor Families

text-document/index
ocr/index
vlm-document/index
speech-to-text/index
pipeline-utilities/index
```

## Extractor Categories

### [Text & Document Processing](text-document/index.md)

Basic text extraction from structured documents and plain text formats.

- [**pass-through-text**](text-document/pass-through.md) - Returns existing extracted text without re-extraction
- [**metadata-text**](text-document/metadata.md) - Extracts text from item metadata (title, tags, etc.)
- [**pdf-text**](text-document/pdf.md) - Extracts text from PDF documents using pypdf
- [**markitdown**](text-document/markitdown.md) - Microsoft MarkItDown for Office documents and web content
- [**unstructured**](text-document/unstructured.md) - Unstructured.io for complex document parsing

### [Optical Character Recognition (OCR)](ocr/index.md)

Traditional OCR for extracting text from images and scanned documents.

- [**ocr-rapidocr**](ocr/rapidocr.md) - RapidOCR for fast ONNX-based text recognition
- [**ocr-paddleocr-vl**](ocr/paddleocr-vl.md) - PaddleOCR vision-language model for document understanding

### [Vision-Language Models (VLM)](vlm-document/index.md)

Advanced VLM-based document understanding with layout analysis and structured extraction.

- [**docling-smol**](vlm-document/docling-smol.md) - SmolDocling-256M for fast document processing
- [**docling-granite**](vlm-document/docling-granite.md) - Granite Docling-258M for high-accuracy extraction

### [Speech-to-Text (STT)](speech-to-text/index.md)

Audio transcription for spoken content in video and audio files.

- [**stt-openai**](speech-to-text/openai.md) - OpenAI Whisper API for audio transcription
- [**stt-deepgram**](speech-to-text/deepgram.md) - Deepgram Nova-2 for fast, accurate transcription
- [**stt-aldea**](speech-to-text/aldea.md) - Aldea Speech-to-Text API for audio transcription
- [**deepgram-transform**](speech-to-text/deepgram-transform.md) - Render structured Deepgram metadata into text

### [Pipeline Utilities](pipeline-utilities/index.md)

Meta-extractors for combining, selecting, and orchestrating extraction strategies.

- [**select-text**](pipeline-utilities/select-text.md) - Selects first successful extractor from a list
- [**select-longest-text**](pipeline-utilities/select-longest.md) - Selects longest output from multiple extractors
- [**select-override**](pipeline-utilities/select-override.md) - Overrides extraction for specific items by ID
- [**select-smart-override**](pipeline-utilities/select-smart-override.md) - Overrides extraction based on media type patterns
- [**pipeline**](pipeline-utilities/pipeline.md) - Chains multiple extractors sequentially

## Quick Start

### Installation

Most extractors require optional dependencies:

```bash
# Basic text extraction (included by default)
pip install biblicus

# OCR extractors
pip install biblicus[ocr]           # RapidOCR
pip install biblicus[paddleocr]     # PaddleOCR VL

# VLM document understanding
pip install biblicus[docling]       # Docling (Transformers backend)
pip install biblicus[docling-mlx]   # Docling (MLX backend for Apple Silicon)

# Speech-to-text
pip install biblicus[openai]        # OpenAI Whisper
pip install biblicus[deepgram]      # Deepgram Nova-2

# Document processing
pip install biblicus[markitdown]    # MarkItDown (Python 3.10+)
pip install biblicus[unstructured]  # Unstructured.io
```

### Basic Usage

#### Command Line

```bash
# Initialize corpus
biblicus init my-corpus

# Ingest documents
biblicus ingest my-corpus document.pdf

# Extract text with specific extractor
biblicus extract my-corpus --extractor pdf-text
```

#### Python API

```python
from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract text using an extractor
results = corpus.extract_text(extractor_id="pdf-text")
```

## Choosing an Extractor

### For PDF Documents

- **Simple PDFs with text layers**: Use [pdf-text](text-document/pdf.md) (fast, no dependencies)
- **Scanned PDFs or complex layouts**: Use [ocr-rapidocr](ocr/rapidocr.md) or VLM extractors
- **Tables, equations, complex structure**: Use [docling-granite](vlm-document/docling-granite.md)

### For Office Documents

- **DOCX, XLSX, PPTX**: Use [markitdown](text-document/markitdown.md) or [unstructured](text-document/unstructured.md)
- **Complex layouts or scanned documents**: Use VLM extractors

### For Images

- **Simple text recognition**: Use [ocr-rapidocr](ocr/rapidocr.md)
- **Complex documents in images**: Use [ocr-paddleocr-vl](ocr/paddleocr-vl.md) or VLM extractors

### For Audio/Video

- **High accuracy, cost-effective**: Use [stt-deepgram](speech-to-text/deepgram.md)
- **OpenAI ecosystem integration**: Use [stt-openai](speech-to-text/openai.md)

### For Multiple Strategies

- **Fallback chain**: Use [select-text](pipeline-utilities/select-text.md)
- **Best output selection**: Use [select-longest-text](pipeline-utilities/select-longest.md)
- **Per-item overrides**: Use [select-override](pipeline-utilities/select-override.md) or [select-smart-override](pipeline-utilities/select-smart-override.md)

## See Also

- [extraction.md](../extraction.md) - Extraction pipeline concepts and architecture
- [API Reference](../api.rst) - Python API documentation
- Repository README - Getting started guide