Demos

This document is a set of runnable examples you can use to see the current system working end to end. Each section links to a textbook chapter so you can read the concept and then run the code.

For the ordered plan of what to build next, see docs/roadmap.md.

Working examples you can run now

Use the examples in order if you are new to the system. They build from ingestion to extraction, retrieval, evaluation, and analysis.

Install for local development

From the repository root:

poetry install --with dev

Create a corpus and ingest a few items

rm -rf corpora/demo
python -m biblicus init corpora/demo

python -m biblicus ingest --corpus corpora/demo --note "Hello from a note" --title "First note" --tags "demo,notes"

printf "A tiny text file\n" > /tmp/biblicus-demo.txt
python -m biblicus ingest --corpus corpora/demo /tmp/biblicus-demo.txt

python -m biblicus ingest --corpus corpora/demo https://example.com

python -m biblicus list --corpus corpora/demo

Show an item

Copy an item identifier from the list output, then run:

python -m biblicus show --corpus corpora/demo ITEM_ID

Edit raw files and reindex

The catalog is rebuildable. You can edit raw files or sidecar metadata, then refresh the catalog.

python -m biblicus reindex --corpus corpora/demo

Crawl a website prefix

To turn a website section into corpus items, crawl a root page and restrict the crawl to an allowed prefix.

In one terminal, create a tiny local website and serve it:

rm -rf /tmp/biblicus-site
mkdir -p /tmp/biblicus-site/site/subdir
cat > /tmp/biblicus-site/site/index.html <<'HTML'
<html>
  <body>
    <a href="page.html">Page</a>
    <a href="subdir/">Subdir</a>
  </body>
</html>
HTML
cat > /tmp/biblicus-site/site/page.html <<'HTML'
<html><body>hello</body></html>
HTML
cat > /tmp/biblicus-site/site/subdir/index.html <<'HTML'
<html><body>subdir</body></html>
HTML

python -m http.server 8000 --directory /tmp/biblicus-site

In another terminal:

rm -rf corpora/crawl-demo
python -m biblicus init corpora/crawl-demo
python -m biblicus crawl --corpus corpora/crawl-demo \
  --root-url http://127.0.0.1:8000/site/index.html \
  --allowed-prefix http://127.0.0.1:8000/site/ \
  --max-items 50 \
  --tag crawled
python -m biblicus list --corpus corpora/crawl-demo

Build an extraction snapshot

Text extraction is a separate pipeline stage from retrieval. An extraction snapshot produces derived text artifacts under the corpus.

This extractor reads text items and skips non-text items.

python -m biblicus extract build --corpus corpora/demo --stage pass-through-text

The output includes a snapshot_id you can reuse when building a retrieval backend.

Text extraction details: docs/extraction.md

Graph extraction demo

Graph extraction runs after text extraction and writes to a Neo4j backend. The demo script will reuse the latest extraction snapshot or build a minimal one if needed.

python scripts/graph_extraction_demo.py --corpus corpora/demo --build-extraction --prepare-demo --verify

Graph extraction details: docs/graph-extraction.md

Graph extraction integration run

Use the integration script to download a small Wikipedia corpus, run extraction, and build a Neo4j graph snapshot with the simple-entities extractor.

python -m pip install neo4j

python scripts/graph_extraction_integration.py \
  --corpus corpora/wiki_graph_demo \
  --force \
  --verify \
  --report-path reports/graph_extraction_story.md

The report written to reports/graph_extraction_story.md summarizes the run in a shareable format.

Graph extraction baselines

Once the baseline extractors are enabled, you can compare different graph extractors by switching the extractor id and configuration:

python -m biblicus graph extract \
  --corpus corpora/example \
  --extractor ner-entities \
  --extraction-snapshot pipeline:RUN_ID \
  --configuration configurations/graph/ner-entities.yml

python -m biblicus graph extract \
  --corpus corpora/example \
  --extractor dependency-relations \
  --extraction-snapshot pipeline:RUN_ID \
  --configuration configurations/graph/dependency-relations.yml

Graph extractor narrative demos

Use the narrative demo script to run the full pipeline for a single extractor and print inputs + outputs. Each command downloads a small Wikipedia corpus, builds extraction and graph snapshots, and prints sample entities/terms and edges.

python scripts/graph_extraction_extractor_demo.py \
  --corpus corpora/wiki_graph_demo \
  --force \
  --extractor simple-entities

python scripts/graph_extraction_extractor_demo.py \
  --corpus corpora/wiki_graph_demo \
  --force \
  --extractor cooccurrence

python scripts/graph_extraction_extractor_demo.py \
  --corpus corpora/wiki_graph_demo \
  --force \
  --extractor ner-entities

python scripts/graph_extraction_extractor_demo.py \
  --corpus corpora/wiki_graph_demo \
  --force \
  --extractor dependency-relations

Graph extractor narrative demos (all extractors)

Use the multi-extractor demo script to run the narrative demo for every extractor in sequence. The first extractor run initializes the corpus; the remaining runs reuse it.

python scripts/graph_extraction_demo_all.py \
  --corpus corpora/wiki_graph_demo \
  --force \
  --limit 5 \
  --report-dir reports

Graph extraction details: docs/graph-extraction.md

Topic modeling integration run

Use the integration script to download AG News, run extraction, and run topic modeling with a single command. Install optional dependencies first:

python -m pip install "biblicus[datasets,topic-modeling]"

python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force

Topic modeling details: docs/topic-modeling.md

Extraction evaluation demo run

Use the extraction evaluation demo to build an extraction snapshot, write a labeled dataset from AG News items, and evaluate coverage and accuracy.

Install optional dependencies first:

python -m pip install "biblicus[datasets]"

python scripts/extraction_evaluation_demo.py --corpus corpora/ag_news_extraction_eval --force

The script prints the dataset path, extraction snapshot reference, and evaluation output path so you can inspect the results.

Extraction evaluation details: docs/extraction-evaluation.md

Extraction evaluation lab run

Use the lab script for a fast, fully local walkthrough with bundled files and labels:

python scripts/extraction_evaluation_lab.py --corpus corpora/extraction_eval_lab --force

The lab writes a generated dataset file and evaluation output path and prints both in the command output.

Extraction evaluation lab details: docs/extraction-evaluation.md

Retrieval evaluation lab run

Use the retrieval evaluation lab to build a tiny corpus, run extraction, build a retrieval backend, and evaluate it against bundled labels:

python scripts/retrieval_evaluation_lab.py --corpus corpora/retrieval_eval_lab --force

The script prints the dataset path, retrieval snapshot identifier, and evaluation output location.

Retrieval evaluation details: docs/retrieval-evaluation.md

Run with a larger corpus and a higher topic count:

python scripts/topic_modeling_integration.py \
  --corpus corpora/ag_news_demo \
  --force \
  --limit 10000 \
  --vectorizer-ngram-min 1 \
  --vectorizer-ngram-max 2 \
  --bertopic-param nr_topics=8 \
  --bertopic-param min_topic_size=2

The command prints the analysis snapshot identifier and the output path. Open the output.json file to inspect per-topic labels, keywords, and document examples.

Profiling analysis demo

The profiling demo downloads AG News, runs extraction, and produces a profiling report.

python scripts/profiling_demo.py --corpus corpora/profiling_demo --force

Profiling details: docs/profiling.md

Select extracted text within a pipeline

When you want an explicit choice among multiple extraction outputs, add a selection extractor stage at the end of the pipeline.

python -m biblicus extract build --corpus corpora/demo \
  --stage pass-through-text \
  --stage metadata-text \
  --stage select-text

Copy the snapshot_id from the JavaScript Object Notation output. Use it as EXTRACTION_SNAPSHOT_ID in the next command.

python -m biblicus build --corpus corpora/demo --backend sqlite-full-text-search \
  --config extraction_snapshot=pipeline:EXTRACTION_SNAPSHOT_ID

Extraction pipeline details: docs/extraction.md

Portable Document Format extraction and retrieval

This example downloads a small set of public Portable Document Format files, extracts text, builds a local full text index, and runs a query.

rm -rf corpora/pdf_samples
python scripts/download_pdf_samples.py --corpus corpora/pdf_samples --force

python -m biblicus extract build --corpus corpora/pdf_samples --stage pdf-text

Copy the snapshot_id from the JavaScript Object Notation output. Use it as PDF_EXTRACTION_SNAPSHOT_ID in the next command.

python -m biblicus build --corpus corpora/pdf_samples --backend sqlite-full-text-search --config extraction_snapshot=pipeline:PDF_EXTRACTION_SNAPSHOT_ID --config chunk_size=200 --config chunk_overlap=50 --config snippet_characters=120
python -m biblicus query --corpus corpora/pdf_samples --query "Dummy PDF file"

Retrieval details: docs/retrieval.md

MarkItDown extraction demo (Python 3.10+)

MarkItDown requires Python 3.10 or higher. This example uses the py311 conda environment to run the extractor over the mixed sample corpus.

conda run -n py311 python -m pip install -e . "markitdown[all]"
conda run -n py311 python scripts/download_mixed_samples.py --corpus corpora/markitdown_demo_py311 --force
conda run -n py311 python -m biblicus extract build --corpus corpora/markitdown_demo_py311 --stage markitdown

Mixed modality integration corpus

This example assembles a tiny mixed corpus with a Markdown note, a Hypertext Markup Language page, an image, a Portable Document Format file with extractable text, and a generated Portable Document Format file with no extractable text. It also includes a downloaded Office Open Extensible Markup Language document to support catchall extraction experiments.

rm -rf corpora/mixed_samples
python scripts/download_mixed_samples.py --corpus corpora/mixed_samples --force
python -m biblicus list --corpus corpora/mixed_samples

Image samples (for optical character recognition experiments)

This example downloads a tiny image corpus intended for optical character recognition experiments: one image that contains text and one that should not.

rm -rf corpora/image_samples
python scripts/download_image_samples.py --corpus corpora/image_samples --force
python -m biblicus list --corpus corpora/image_samples

To perform optical character recognition on the image items, install the optional dependency:

python -m pip install "biblicus[ocr]"

Then build an extraction snapshot:

python -m biblicus extract build --corpus corpora/image_samples --stage ocr-rapidocr

Optional: Unstructured as a last-resort extractor

The unstructured extractor is an optional dependency. It is intended as a last-resort extractor for non-text items.

Install the optional dependency:

python -m pip install "biblicus[unstructured]"

Then build an extraction snapshot:

python -m biblicus extract build --corpus corpora/pdf_samples --stage unstructured

To see Unstructured handle a non-Portable-Document-Format format, use the mixed corpus demo, which includes a .docx sample:

rm -rf corpora/mixed_samples
python scripts/download_mixed_samples.py --corpus corpora/mixed_samples --force
python -m biblicus extract build --corpus corpora/mixed_samples --stage unstructured

When you want to prefer one extractor over another for the same item types, order the stages and end with select-text:

python -m biblicus extract build --corpus corpora/pdf_samples \
  --stage unstructured \
  --stage pdf-text \
  --stage select-text

Optional: Speech to text for audio items

This example downloads a small set of public speech samples from Wikimedia Commons and uses extraction to derive text artifacts. It also includes a generated Waveform Audio File Format silence clip for repeatable non-speech cases.

Download the integration corpus:

rm -rf corpora/audio_samples
python scripts/download_audio_samples.py --corpus corpora/audio_samples --force
python -m biblicus list --corpus corpora/audio_samples

If you only want a metadata-only baseline, extract metadata-text:

python -m biblicus extract build --corpus corpora/audio_samples --stage metadata-text

For real speech to text transcription with the OpenAI backend, install the optional dependency and set an API key:

python -m pip install "biblicus[openai]"
mkdir -p .biblicus
printf "openai:\n  api_key: ...\n" > .biblicus/config.yml
python -m biblicus extract build --corpus corpora/audio_samples --stage stt-openai

Build and query the minimal backend

The scan backend is a minimal baseline that reads raw items directly.

python -m biblicus build --corpus corpora/demo --backend scan
python -m biblicus query --corpus corpora/demo --query "Hello"

Backend details: docs/backends.md

Build and query the practical backend

The sqlite full text search backend builds a local index under the snapshot directory.

python -m biblicus build --corpus corpora/demo --backend sqlite-full-text-search --config extraction_snapshot=pipeline:EXTRACTION_SNAPSHOT_ID
python -m biblicus query --corpus corpora/demo --query "tiny"

Backend details: docs/backends.md

Run the test suite and view coverage

python scripts/test.py
open reports/htmlcov/index.html

To include integration scenarios that download public test data at runtime:

python scripts/test.py --integration

Testing details: docs/testing.md

Documentation map

Corpus: docs/corpus.md
Text extraction: docs/extraction.md
Backends: docs/backends.md
Testing: docs/testing.md
Roadmap: docs/roadmap.md

For what to build next, see docs/roadmap.md.