# Text Extraction Pipeline

Text extraction is a separate pipeline stage that produces derived text artifacts under a corpus.

This separation matters because it lets you combine extraction choices and retrieval backends independently.

For detailed documentation on specific extractors, see [Extractor Reference](extractors/index.md).

## What extraction produces

An extraction snapshot produces:

- A snapshot manifest
- Per item extracted text files for the final output
- Per stage extracted text artifacts for all pipeline stages
- Per item result status, including extracted, skipped, and errored outcomes

Extraction artifacts are stored under the corpus:

```
corpus/
  extracted/
    pipeline/
      <snapshot id>/
        manifest.json
        text/
          <item id>.txt
        stages/
          01-pass-through-text/
            text/
              <item id>.txt
```

### Output structure

Extraction output is structured and inspectable. The manifest captures the configuration and per-item status:

```json
{
  "snapshot_id": "RUN_ID",
  "extractor_id": "pipeline",
  "configuration": {
    "name": "default",
    "stages": ["pass-through-text", "metadata-text"]
  },
  "stats": {
    "total_items": 3,
    "extracted_items": 2,
    "skipped_items": 1,
    "errored_items": 0
  }
}
```

The `text/` folder contains the final extracted text for each item, while `stages/` preserves all intermediate outputs.

## Reproducibility checklist

- Record the extraction snapshot identifier (`extractor_id:snapshot_id`).
- Keep the pipeline stages and stage order in source control (configuration files are preferred).
- Capture the catalog timestamp when comparing extraction snapshots.

## Available Extractors

Biblicus provides built-in extractors organized by category:

### Text & Document Processing

- [`pass-through-text`](extractors/text-document/pass-through.md) - Direct text file reading
- [`metadata-text`](extractors/text-document/metadata.md) - Text from item metadata
- [`pdf-text`](extractors/text-document/pdf.md) - PDF text extraction using pypdf
- [`markitdown`](extractors/text-document/markitdown.md) - Office documents via MarkItDown
- [`unstructured`](extractors/text-document/unstructured.md) - Universal document parsing

### Optical Character Recognition

- [`ocr-rapidocr`](extractors/ocr/rapidocr.md) - Fast ONNX-based OCR
- [`ocr-paddleocr-vl`](extractors/ocr/paddleocr-vl.md) - Advanced OCR with VL model

### Vision-Language Models

- [`docling-smol`](extractors/vlm-document/docling-smol.md) - SmolDocling-256M for fast document processing
- [`docling-granite`](extractors/vlm-document/docling-granite.md) - Granite Docling-258M for high-accuracy extraction

### Speech-to-Text

- [`stt-openai`](extractors/speech-to-text/openai.md) - OpenAI Whisper API
- [`stt-deepgram`](extractors/speech-to-text/deepgram.md) - Deepgram Nova-3 API
- [`stt-aldea`](extractors/speech-to-text/aldea.md) - Aldea Speech-to-Text API
- [`deepgram-transform`](extractors/speech-to-text/deepgram-transform.md) - Render structured Deepgram metadata

### Pipeline Utilities

- [`select-text`](extractors/pipeline-utilities/select-text.md) - First successful extractor
- [`select-longest-text`](extractors/pipeline-utilities/select-longest.md) - Longest output selection
- [`select-override`](extractors/pipeline-utilities/select-override.md) - Per-item override by ID
- [`select-smart-override`](extractors/pipeline-utilities/select-smart-override.md) - Media type-based routing
- [`pipeline`](extractors/pipeline-utilities/pipeline.md) - Multi-stage extraction workflow

For detailed documentation including configuration options, usage examples, and best practices, see the [Extractor Reference](extractors/index.md).

## How selection chooses text

The `select-text` extractor does not attempt to judge extraction quality. It chooses the first usable text from prior pipeline outputs in pipeline order.

Usable means non-empty after stripping whitespace.

This means selection does not automatically choose the longest extracted text or the extraction with the most content. If you want a scoring rule such as choose the longest extracted text, use the [`select-longest-text`](extractors/pipeline-utilities/select-longest.md) extractor instead.

Other selection strategies include:

- [`select-override`](extractors/pipeline-utilities/select-override.md) - Override extraction for specific items by ID
- [`select-smart-override`](extractors/pipeline-utilities/select-smart-override.md) - Route items based on media type patterns

## Pipeline extractor

The `pipeline` extractor composes multiple extractors into an explicit pipeline.

The pipeline runs every stage in order and records all stage outputs. Each stage receives the raw item and the outputs of all prior stages. The final extracted text is the last extracted output in pipeline order.

This lets you build explicit extraction policies while keeping every stage outcome available for comparison and metrics.

For details, see the [`pipeline` extractor documentation](extractors/pipeline-utilities/pipeline.md).

## Complementary versus competing extractors

The pipeline is designed for complementary stages that do not overlap much in what they handle.

Examples of complementary stages:

- A text extractor that only applies to text items
- A Portable Document Format text extractor that only applies to `application/pdf`
- An optical character recognition extractor that applies to images and scanned Portable Document Format files
- A speech to text extractor that applies to audio items
- A metadata extractor that always applies but produces low fidelity fallback text

Competing extractors are different. Competing extractors both claim they can handle the same item type, but they might produce different output quality. When you want to compare or switch between competing extractors, make that decision explicit with a selection extractor stage such as `select-text` or a custom selection extractor.

## Example: extract from a corpus

```
rm -rf corpora/extraction-demo
python -m biblicus init corpora/extraction-demo

printf 'x' > /tmp/image.png
python -m biblicus ingest --corpus corpora/extraction-demo /tmp/image.png --tag extracted

python -m biblicus extract build --corpus corpora/extraction-demo \
  --stage pass-through-text \
  --stage pdf-text \
  --stage metadata-text
```

The extracted text for the image comes from the `metadata-text` stage because the image is not a text item.

## Example: selection within a pipeline

Selection is a pipeline stage that chooses extracted text from previous pipeline stages. Selection is just another extractor in the pipeline, and it decides which prior output to carry forward.

```
python -m biblicus extract build --corpus corpora/extraction-demo \
  --stage pass-through-text \
  --stage metadata-text \
  --stage select-text
```

The pipeline run produces one extraction snapshot under `pipeline`. You can point retrieval backends at that run.

## Example: PDF with OCR fallback

Try text extraction first, fall back to OCR for scanned documents:

```
python -m biblicus extract build --corpus corpora/extraction-demo \
  --stage pdf-text \
  --stage ocr-rapidocr \
  --stage select-text
```

This pipeline tries `pdf-text` first for PDFs with text layers, falls back to `ocr-rapidocr` for scanned PDFs, and uses `select-text` to pick the first successful result.

## Example: VLM for complex documents

Use vision-language models for documents with complex layouts:

```
python -m biblicus extract build --corpus corpora/extraction-demo \
  --stage docling-granite
```

The `docling-granite` extractor uses IBM Research's Granite Docling-258M VLM for high-accuracy extraction of tables, code blocks, and equations.

## Inspecting and deleting extraction snapshots

Extraction runs are stored under the corpus and can be listed and inspected.

```
python -m biblicus extract list --corpus corpora/extraction-demo
python -m biblicus extract show --corpus corpora/extraction-demo --run pipeline:EXTRACTION_SNAPSHOT_ID
```

## Common pitfalls

- Comparing extraction snapshots built from different pipeline stage orders.
- Forgetting to capture the extraction snapshot reference before building retrieval snapshots.
- Assuming selection stages choose the “best” output rather than the first usable output.

Deletion is explicit and requires typing the exact run reference as confirmation:

```
python -m biblicus extract delete --corpus corpora/extraction-demo \
  --run pipeline:EXTRACTION_SNAPSHOT_ID \
  --confirm pipeline:EXTRACTION_SNAPSHOT_ID
```

## Use extracted text in retrieval

Retrieval backends can build and query using a selected extraction snapshot. This is configured by passing `extraction_snapshot=extractor_id:snapshot_id` to the backend build command.

```
python -m biblicus build --corpus corpora/extraction-demo --backend sqlite-full-text-search \
  --config extraction_snapshot=pipeline:EXTRACTION_SNAPSHOT_ID
python -m biblicus query --corpus corpora/extraction-demo --query extracted
```

## Evaluate extraction quality

Extraction evaluation measures coverage and accuracy for a given extractor configuration. See `docs/extraction-evaluation.md`
for the dataset format, command-line interface usage, and report interpretation.

## What extraction is not

Text extraction does not mutate the raw corpus. It is derived output that can be regenerated and compared across implementations.

Optical character recognition and speech to text are implemented as extractors so you can compare providers and configurations while keeping raw items immutable.