Text Extraction Pipeline
Text extraction is a separate pipeline stage that produces derived text artifacts under a corpus.
This separation matters because it lets you combine extraction choices and retrieval backends independently.
For detailed documentation on specific extractors, see Extractor Reference.
What extraction produces
An extraction snapshot produces:
A snapshot manifest
Per item extracted text files for the final output
Per stage extracted text artifacts for all pipeline stages
Per item result status, including extracted, skipped, and errored outcomes
Extraction artifacts are stored under the corpus:
corpus/
extracted/
pipeline/
<snapshot id>/
manifest.json
text/
<item id>.txt
stages/
01-pass-through-text/
text/
<item id>.txt
Output structure
Extraction output is structured and inspectable. The manifest captures the configuration and per-item status:
{
"snapshot_id": "RUN_ID",
"extractor_id": "pipeline",
"configuration": {
"name": "default",
"stages": ["pass-through-text", "metadata-text"]
},
"stats": {
"total_items": 3,
"extracted_items": 2,
"skipped_items": 1,
"errored_items": 0
}
}
The text/ folder contains the final extracted text for each item, while stages/ preserves all intermediate outputs.
Reproducibility checklist
Record the extraction snapshot identifier (
extractor_id:snapshot_id).Keep the pipeline stages and stage order in source control (configuration files are preferred).
Capture the catalog timestamp when comparing extraction snapshots.
Available Extractors
Biblicus provides built-in extractors organized by category:
Text & Document Processing
pass-through-text- Direct text file readingmetadata-text- Text from item metadatapdf-text- PDF text extraction using pypdfmarkitdown- Office documents via MarkItDownunstructured- Universal document parsing
Optical Character Recognition
ocr-rapidocr- Fast ONNX-based OCRocr-paddleocr-vl- Advanced OCR with VL model
Vision-Language Models
docling-smol- SmolDocling-256M for fast document processingdocling-granite- Granite Docling-258M for high-accuracy extraction
Speech-to-Text
stt-openai- OpenAI Whisper APIstt-deepgram- Deepgram Nova-3 APIstt-aldea- Aldea Speech-to-Text APIdeepgram-transform- Render structured Deepgram metadata
Pipeline Utilities
select-text- First successful extractorselect-longest-text- Longest output selectionselect-override- Per-item override by IDselect-smart-override- Media type-based routingpipeline- Multi-stage extraction workflow
For detailed documentation including configuration options, usage examples, and best practices, see the Extractor Reference.
How selection chooses text
The select-text extractor does not attempt to judge extraction quality. It chooses the first usable text from prior pipeline outputs in pipeline order.
Usable means non-empty after stripping whitespace.
This means selection does not automatically choose the longest extracted text or the extraction with the most content. If you want a scoring rule such as choose the longest extracted text, use the select-longest-text extractor instead.
Other selection strategies include:
select-override- Override extraction for specific items by IDselect-smart-override- Route items based on media type patterns
Pipeline extractor
The pipeline extractor composes multiple extractors into an explicit pipeline.
The pipeline runs every stage in order and records all stage outputs. Each stage receives the raw item and the outputs of all prior stages. The final extracted text is the last extracted output in pipeline order.
This lets you build explicit extraction policies while keeping every stage outcome available for comparison and metrics.
For details, see the pipeline extractor documentation.
Complementary versus competing extractors
The pipeline is designed for complementary stages that do not overlap much in what they handle.
Examples of complementary stages:
A text extractor that only applies to text items
A Portable Document Format text extractor that only applies to
application/pdfAn optical character recognition extractor that applies to images and scanned Portable Document Format files
A speech to text extractor that applies to audio items
A metadata extractor that always applies but produces low fidelity fallback text
Competing extractors are different. Competing extractors both claim they can handle the same item type, but they might produce different output quality. When you want to compare or switch between competing extractors, make that decision explicit with a selection extractor stage such as select-text or a custom selection extractor.
Example: extract from a corpus
rm -rf corpora/extraction-demo
python -m biblicus init corpora/extraction-demo
printf 'x' > /tmp/image.png
python -m biblicus ingest --corpus corpora/extraction-demo /tmp/image.png --tag extracted
python -m biblicus extract build --corpus corpora/extraction-demo \
--stage pass-through-text \
--stage pdf-text \
--stage metadata-text
The extracted text for the image comes from the metadata-text stage because the image is not a text item.
Example: selection within a pipeline
Selection is a pipeline stage that chooses extracted text from previous pipeline stages. Selection is just another extractor in the pipeline, and it decides which prior output to carry forward.
python -m biblicus extract build --corpus corpora/extraction-demo \
--stage pass-through-text \
--stage metadata-text \
--stage select-text
The pipeline run produces one extraction snapshot under pipeline. You can point retrieval backends at that run.
Example: PDF with OCR fallback
Try text extraction first, fall back to OCR for scanned documents:
python -m biblicus extract build --corpus corpora/extraction-demo \
--stage pdf-text \
--stage ocr-rapidocr \
--stage select-text
This pipeline tries pdf-text first for PDFs with text layers, falls back to ocr-rapidocr for scanned PDFs, and uses select-text to pick the first successful result.
Example: VLM for complex documents
Use vision-language models for documents with complex layouts:
python -m biblicus extract build --corpus corpora/extraction-demo \
--stage docling-granite
The docling-granite extractor uses IBM Research’s Granite Docling-258M VLM for high-accuracy extraction of tables, code blocks, and equations.
Inspecting and deleting extraction snapshots
Extraction runs are stored under the corpus and can be listed and inspected.
python -m biblicus extract list --corpus corpora/extraction-demo
python -m biblicus extract show --corpus corpora/extraction-demo --run pipeline:EXTRACTION_SNAPSHOT_ID
Common pitfalls
Comparing extraction snapshots built from different pipeline stage orders.
Forgetting to capture the extraction snapshot reference before building retrieval snapshots.
Assuming selection stages choose the “best” output rather than the first usable output.
Deletion is explicit and requires typing the exact run reference as confirmation:
python -m biblicus extract delete --corpus corpora/extraction-demo \
--run pipeline:EXTRACTION_SNAPSHOT_ID \
--confirm pipeline:EXTRACTION_SNAPSHOT_ID
Use extracted text in retrieval
Retrieval backends can build and query using a selected extraction snapshot. This is configured by passing extraction_snapshot=extractor_id:snapshot_id to the backend build command.
python -m biblicus build --corpus corpora/extraction-demo --backend sqlite-full-text-search \
--config extraction_snapshot=pipeline:EXTRACTION_SNAPSHOT_ID
python -m biblicus query --corpus corpora/extraction-demo --query extracted
Evaluate extraction quality
Extraction evaluation measures coverage and accuracy for a given extractor configuration. See docs/extraction-evaluation.md
for the dataset format, command-line interface usage, and report interpretation.
What extraction is not
Text extraction does not mutate the raw corpus. It is derived output that can be regenerated and compared across implementations.
Optical character recognition and speech to text are implemented as extractors so you can compare providers and configurations while keeping raw items immutable.