# Text & Document Processing

Basic text extraction from structured documents and plain text formats.

```{toctree}
:maxdepth: 1
:caption: Text & Document Extractors

pass-through
metadata
pdf
markitdown
unstructured
```

## Overview

Text and document extractors handle files with explicit text content or structured document formats. These extractors are ideal for:

- PDF documents with text layers
- Office documents (DOCX, XLSX, PPTX)
- Markdown and HTML files
- Plain text files
- Web content

## Available Extractors

### [pass-through-text](pass-through.md)

Returns existing extracted text without re-extraction. Useful for:
- Skipping extraction when text already exists
- Testing and debugging pipelines
- Preserving manually curated text

**Installation**: Included by default

### [metadata-text](metadata.md)

Extracts text from item metadata (title, tags, keywords). Useful for:
- Creating searchable metadata
- Generating synthetic corpus entries
- Testing without file content

**Installation**: Included by default

### [pdf-text](pdf.md)

Extracts text from PDF documents using pypdf. Ideal for:
- PDFs with selectable text layers
- Fast extraction without OCR overhead
- Simple document processing

**Installation**: Included by default

### [markitdown](markitdown.md)

Microsoft MarkItDown for Office documents and web content. Supports:
- DOCX, XLSX, PPTX
- HTML, MHTML
- Images (via OCR)
- Audio (via transcription)
- ZIP archives

**Installation**: `pip install biblicus[markitdown]` (Python 3.10+ only)

### [unstructured](unstructured.md)

Unstructured.io for complex document parsing. Supports:
- DOCX, XLSX, PPTX
- Email formats (EML, MSG)
- Markdown, HTML, XML
- Advanced chunking and partitioning

**Installation**: `pip install biblicus[unstructured]`

## Choosing an Extractor

| Format | Recommended Extractor | Alternative |
|--------|----------------------|-------------|
| PDF (text layer) | [pdf-text](pdf.md) | [markitdown](markitdown.md) |
| PDF (scanned) | See [OCR](../ocr/index.md) or [VLM](../vlm-document/index.md) | |
| DOCX, XLSX, PPTX | [markitdown](markitdown.md) | [unstructured](unstructured.md) |
| Markdown, HTML | [markitdown](markitdown.md) | [unstructured](unstructured.md) |
| Plain text | [pass-through-text](pass-through.md) | [metadata-text](metadata.md) |

## Common Patterns

### Fallback Chain

Use [select-text](../pipeline-utilities/select-text.md) to try multiple extractors:

```yaml
extractor_id: select-text
config:
  extractors:
    - pdf-text
    - markitdown
    - unstructured
```

### Metadata + Content

Use [pipeline](../pipeline-utilities/pipeline.md) to combine metadata and content:

```yaml
extractor_id: pipeline
config:
  extractors:
    - metadata-text
    - pdf-text
```

## See Also

- [Extractors Overview](../index.md)
- [OCR Extractors](../ocr/index.md) - For scanned documents
- [VLM Extractors](../vlm-document/index.md) - For complex layouts
- [Pipeline Utilities](../pipeline-utilities/index.md) - For combining strategies