Text Extraction Pipeline

Text extraction is a separate pipeline stage that produces derived text artifacts under a corpus.

This separation matters because it lets you combine extraction choices and retrieval backends independently.

For detailed documentation on specific extractors, see Extractor Reference.

What extraction produces

An extraction snapshot produces:

A snapshot manifest
Per item extracted text files for the final output
Per stage extracted text artifacts for all pipeline stages
Per item result status, including extracted, skipped, and errored outcomes

Extraction artifacts are stored under the corpus:

corpus/
  extracted/
    pipeline/
      <snapshot id>/
        manifest.json
        text/
          <item id>.txt
        stages/
          01-pass-through-text/
            text/
              <item id>.txt

Output structure

Extraction output is structured and inspectable. The manifest captures the configuration and per-item status:

{
  "snapshot_id": "RUN_ID",
  "extractor_id": "pipeline",
  "configuration": {
    "name": "default",
    "stages": ["pass-through-text", "metadata-text"]
  },
  "stats": {
    "total_items": 3,
    "extracted_items": 2,
    "skipped_items": 1,
    "errored_items": 0
  }
}

The text/ folder contains the final extracted text for each item, while stages/ preserves all intermediate outputs.

Reproducibility checklist

Record the extraction snapshot identifier (extractor_id:snapshot_id).
Keep the pipeline stages and stage order in source control (configuration files are preferred).
Capture the catalog timestamp when comparing extraction snapshots.

Available Extractors

Biblicus provides built-in extractors organized by category:

Text & Document Processing

pass-through-text - Direct text file reading
metadata-text - Text from item metadata
pdf-text - PDF text extraction using pypdf
markitdown - Office documents via MarkItDown
unstructured - Universal document parsing

Optical Character Recognition

ocr-rapidocr - Fast ONNX-based OCR
ocr-paddleocr-vl - Advanced OCR with VL model

Vision-Language Models

docling-smol - SmolDocling-256M for fast document processing
docling-granite - Granite Docling-258M for high-accuracy extraction

Speech-to-Text

stt-openai - OpenAI Whisper API
stt-deepgram - Deepgram Nova-3 API
stt-aldea - Aldea Speech-to-Text API
deepgram-transform - Render structured Deepgram metadata

Pipeline Utilities

select-text - First successful extractor
select-longest-text - Longest output selection
select-override - Per-item override by ID
select-smart-override - Media type-based routing
pipeline - Multi-stage extraction workflow

For detailed documentation including configuration options, usage examples, and best practices, see the Extractor Reference.

How selection chooses text

The select-text extractor does not attempt to judge extraction quality. It chooses the first usable text from prior pipeline outputs in pipeline order.

Usable means non-empty after stripping whitespace.

This means selection does not automatically choose the longest extracted text or the extraction with the most content. If you want a scoring rule such as choose the longest extracted text, use the select-longest-text extractor instead.

Other selection strategies include:

select-override - Override extraction for specific items by ID
select-smart-override - Route items based on media type patterns

Pipeline extractor

The pipeline extractor composes multiple extractors into an explicit pipeline.

The pipeline runs every stage in order and records all stage outputs. Each stage receives the raw item and the outputs of all prior stages. The final extracted text is the last extracted output in pipeline order.

This lets you build explicit extraction policies while keeping every stage outcome available for comparison and metrics.

For details, see the pipeline extractor documentation.

Complementary versus competing extractors

The pipeline is designed for complementary stages that do not overlap much in what they handle.

Examples of complementary stages:

A text extractor that only applies to text items
A Portable Document Format text extractor that only applies to application/pdf
An optical character recognition extractor that applies to images and scanned Portable Document Format files
A speech to text extractor that applies to audio items
A metadata extractor that always applies but produces low fidelity fallback text

Competing extractors are different. Competing extractors both claim they can handle the same item type, but they might produce different output quality. When you want to compare or switch between competing extractors, make that decision explicit with a selection extractor stage such as select-text or a custom selection extractor.

Example: extract from a corpus

rm -rf corpora/extraction-demo
python -m biblicus init corpora/extraction-demo

printf 'x' > /tmp/image.png
python -m biblicus ingest --corpus corpora/extraction-demo /tmp/image.png --tag extracted

python -m biblicus extract build --corpus corpora/extraction-demo \
  --stage pass-through-text \
  --stage pdf-text \
  --stage metadata-text

The extracted text for the image comes from the metadata-text stage because the image is not a text item.

Example: selection within a pipeline

Selection is a pipeline stage that chooses extracted text from previous pipeline stages. Selection is just another extractor in the pipeline, and it decides which prior output to carry forward.

python -m biblicus extract build --corpus corpora/extraction-demo \
  --stage pass-through-text \
  --stage metadata-text \
  --stage select-text

The pipeline run produces one extraction snapshot under pipeline. You can point retrieval backends at that run.

Example: PDF with OCR fallback

Try text extraction first, fall back to OCR for scanned documents:

python -m biblicus extract build --corpus corpora/extraction-demo \
  --stage pdf-text \
  --stage ocr-rapidocr \
  --stage select-text

This pipeline tries pdf-text first for PDFs with text layers, falls back to ocr-rapidocr for scanned PDFs, and uses select-text to pick the first successful result.

Example: VLM for complex documents

Use vision-language models for documents with complex layouts:

python -m biblicus extract build --corpus corpora/extraction-demo \
  --stage docling-granite

The docling-granite extractor uses IBM Research’s Granite Docling-258M VLM for high-accuracy extraction of tables, code blocks, and equations.

Inspecting and deleting extraction snapshots

Extraction runs are stored under the corpus and can be listed and inspected.

python -m biblicus extract list --corpus corpora/extraction-demo
python -m biblicus extract show --corpus corpora/extraction-demo --run pipeline:EXTRACTION_SNAPSHOT_ID

Common pitfalls

Comparing extraction snapshots built from different pipeline stage orders.
Forgetting to capture the extraction snapshot reference before building retrieval snapshots.
Assuming selection stages choose the “best” output rather than the first usable output.

Deletion is explicit and requires typing the exact run reference as confirmation:

python -m biblicus extract delete --corpus corpora/extraction-demo \
  --run pipeline:EXTRACTION_SNAPSHOT_ID \
  --confirm pipeline:EXTRACTION_SNAPSHOT_ID

Use extracted text in retrieval

Retrieval backends can build and query using a selected extraction snapshot. This is configured by passing extraction_snapshot=extractor_id:snapshot_id to the backend build command.

python -m biblicus build --corpus corpora/extraction-demo --backend sqlite-full-text-search \
  --config extraction_snapshot=pipeline:EXTRACTION_SNAPSHOT_ID
python -m biblicus query --corpus corpora/extraction-demo --query extracted

Evaluate extraction quality

Extraction evaluation measures coverage and accuracy for a given extractor configuration. See docs/extraction-evaluation.md for the dataset format, command-line interface usage, and report interpretation.

What extraction is not

Text extraction does not mutate the raw corpus. It is derived output that can be regenerated and compared across implementations.

Optical character recognition and speech to text are implemented as extractors so you can compare providers and configurations while keeping raw items immutable.