Demos
This document is a set of runnable examples you can use to see the current system working end to end. Each section links to a textbook chapter so you can read the concept and then run the code.
For the ordered plan of what to build next, see docs/roadmap.md.
Working examples you can run now
Use the examples in order if you are new to the system. They build from ingestion to extraction, retrieval, evaluation, and analysis.
Install for local development
From the repository root:
python -m pip install -e ".[dev]"
Create a corpus and ingest a few items
rm -rf corpora/demo
python -m biblicus init corpora/demo
python -m biblicus ingest --corpus corpora/demo --note "Hello from a note" --title "First note" --tags "demo,notes"
printf "A tiny text file\n" > /tmp/biblicus-demo.txt
python -m biblicus ingest --corpus corpora/demo /tmp/biblicus-demo.txt
python -m biblicus ingest --corpus corpora/demo https://example.com
python -m biblicus list --corpus corpora/demo
Show an item
Copy an item identifier from the list output, then run:
python -m biblicus show --corpus corpora/demo ITEM_ID
Edit raw files and reindex
The catalog is rebuildable. You can edit raw files or sidecar metadata, then refresh the catalog.
python -m biblicus reindex --corpus corpora/demo
Crawl a website prefix
To turn a website section into corpus items, crawl a root page and restrict the crawl to an allowed prefix.
In one terminal, create a tiny local website and serve it:
rm -rf /tmp/biblicus-site
mkdir -p /tmp/biblicus-site/site/subdir
cat > /tmp/biblicus-site/site/index.html <<'HTML'
<html>
<body>
<a href="page.html">Page</a>
<a href="subdir/">Subdir</a>
</body>
</html>
HTML
cat > /tmp/biblicus-site/site/page.html <<'HTML'
<html><body>hello</body></html>
HTML
cat > /tmp/biblicus-site/site/subdir/index.html <<'HTML'
<html><body>subdir</body></html>
HTML
python -m http.server 8000 --directory /tmp/biblicus-site
In another terminal:
rm -rf corpora/crawl-demo
python -m biblicus init corpora/crawl-demo
python -m biblicus crawl --corpus corpora/crawl-demo \
--root-url http://127.0.0.1:8000/site/index.html \
--allowed-prefix http://127.0.0.1:8000/site/ \
--max-items 50 \
--tag crawled
python -m biblicus list --corpus corpora/crawl-demo
Build an extraction snapshot
Text extraction is a separate pipeline stage from retrieval. An extraction snapshot produces derived text artifacts under the corpus.
This extractor reads text items and skips non-text items.
python -m biblicus extract build --corpus corpora/demo --stage pass-through-text
The output includes a snapshot_id you can reuse when building a retrieval backend.
Text extraction details: docs/extraction.md
Graph extraction demo
Graph extraction runs after text extraction and writes to a Neo4j backend. The demo script will reuse the latest extraction snapshot or build a minimal one if needed.
python scripts/graph_extraction_demo.py --corpus corpora/demo --build-extraction --prepare-demo --verify
Graph extraction details: docs/graph-extraction.md
Graph extraction integration run
Use the integration script to download a small Wikipedia corpus, run extraction, and build a Neo4j graph snapshot
with the simple-entities extractor.
python -m pip install neo4j
python scripts/graph_extraction_integration.py \
--corpus corpora/wiki_graph_demo \
--force \
--verify \
--report-path reports/graph_extraction_story.md
The report written to reports/graph_extraction_story.md summarizes the run in a shareable format.
Graph extraction baselines
Once the baseline extractors are enabled, you can compare different graph extractors by switching the extractor id and configuration:
python -m biblicus graph extract \
--corpus corpora/example \
--extractor ner-entities \
--extraction-snapshot pipeline:RUN_ID \
--configuration configurations/graph/ner-entities.yml
python -m biblicus graph extract \
--corpus corpora/example \
--extractor dependency-relations \
--extraction-snapshot pipeline:RUN_ID \
--configuration configurations/graph/dependency-relations.yml
Graph extractor narrative demos
Use the narrative demo script to run the full pipeline for a single extractor and print inputs + outputs. Each command downloads a small Wikipedia corpus, builds extraction and graph snapshots, and prints sample entities/terms and edges.
python scripts/graph_extraction_extractor_demo.py \
--corpus corpora/wiki_graph_demo \
--force \
--extractor simple-entities
python scripts/graph_extraction_extractor_demo.py \
--corpus corpora/wiki_graph_demo \
--force \
--extractor cooccurrence
python scripts/graph_extraction_extractor_demo.py \
--corpus corpora/wiki_graph_demo \
--force \
--extractor ner-entities
python scripts/graph_extraction_extractor_demo.py \
--corpus corpora/wiki_graph_demo \
--force \
--extractor dependency-relations
Graph extractor narrative demos (all extractors)
Use the multi-extractor demo script to run the narrative demo for every extractor in sequence. The first extractor run initializes the corpus; the remaining runs reuse it.
python scripts/graph_extraction_demo_all.py \
--corpus corpora/wiki_graph_demo \
--force \
--limit 5 \
--report-dir reports
Graph extraction details: docs/graph-extraction.md
Topic modeling integration run
Use the integration script to download AG News, run extraction, and run topic modeling with a single command. Install optional dependencies first:
python -m pip install "biblicus[datasets,topic-modeling]"
python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
Topic modeling details: docs/topic-modeling.md
Extraction evaluation demo run
Use the extraction evaluation demo to build an extraction snapshot, write a labeled dataset from AG News items, and evaluate coverage and accuracy.
Install optional dependencies first:
python -m pip install "biblicus[datasets]"
python scripts/extraction_evaluation_demo.py --corpus corpora/ag_news_extraction_eval --force
The script prints the dataset path, extraction snapshot reference, and evaluation output path so you can inspect the results.
Extraction evaluation details: docs/extraction-evaluation.md
Extraction evaluation lab run
Use the lab script for a fast, fully local walkthrough with bundled files and labels:
python scripts/extraction_evaluation_lab.py --corpus corpora/extraction_eval_lab --force
The lab writes a generated dataset file and evaluation output path and prints both in the command output.
Extraction evaluation lab details: docs/extraction-evaluation.md
Retrieval evaluation lab run
Use the retrieval evaluation lab to build a tiny corpus, run extraction, build a retrieval backend, and evaluate it against bundled labels:
python scripts/retrieval_evaluation_lab.py --corpus corpora/retrieval_eval_lab --force
The script prints the dataset path, retrieval snapshot identifier, and evaluation output location.
Retrieval evaluation details: docs/retrieval-evaluation.md
Run with a larger corpus and a higher topic count:
python scripts/topic_modeling_integration.py \
--corpus corpora/ag_news_demo \
--force \
--limit 10000 \
--vectorizer-ngram-min 1 \
--vectorizer-ngram-max 2 \
--bertopic-param nr_topics=8 \
--bertopic-param min_topic_size=2
The command prints the analysis snapshot identifier and the output path. Open the output.json file to inspect per-topic labels,
keywords, and document examples.
Profiling analysis demo
The profiling demo downloads AG News, runs extraction, and produces a profiling report.
python scripts/profiling_demo.py --corpus corpora/profiling_demo --force
Profiling details: docs/profiling.md
Select extracted text within a pipeline
When you want an explicit choice among multiple extraction outputs, add a selection extractor stage at the end of the pipeline.
python -m biblicus extract build --corpus corpora/demo \
--stage pass-through-text \
--stage metadata-text \
--stage select-text
Copy the snapshot_id from the JavaScript Object Notation output. Use it as EXTRACTION_SNAPSHOT_ID in the next command.
python -m biblicus build --corpus corpora/demo --backend sqlite-full-text-search \
--config extraction_snapshot=pipeline:EXTRACTION_SNAPSHOT_ID
Extraction pipeline details: docs/extraction.md
Portable Document Format extraction and retrieval
This example downloads a small set of public Portable Document Format files, extracts text, builds a local full text index, and runs a query.
rm -rf corpora/pdf_samples
python scripts/download_pdf_samples.py --corpus corpora/pdf_samples --force
python -m biblicus extract build --corpus corpora/pdf_samples --stage pdf-text
Copy the snapshot_id from the JavaScript Object Notation output. Use it as PDF_EXTRACTION_SNAPSHOT_ID in the next command.
python -m biblicus build --corpus corpora/pdf_samples --backend sqlite-full-text-search --config extraction_snapshot=pipeline:PDF_EXTRACTION_SNAPSHOT_ID --config chunk_size=200 --config chunk_overlap=50 --config snippet_characters=120
python -m biblicus query --corpus corpora/pdf_samples --query "Dummy PDF file"
Retrieval details: docs/retrieval.md
MarkItDown extraction demo (Python 3.10+)
MarkItDown requires Python 3.10 or higher. This example uses the py311 conda environment to run the extractor over the mixed sample corpus.
conda run -n py311 python -m pip install -e . "markitdown[all]"
conda run -n py311 python scripts/download_mixed_samples.py --corpus corpora/markitdown_demo_py311 --force
conda run -n py311 python -m biblicus extract build --corpus corpora/markitdown_demo_py311 --stage markitdown
Mixed modality integration corpus
This example assembles a tiny mixed corpus with a Markdown note, a Hypertext Markup Language page, an image, a Portable Document Format file with extractable text, and a generated Portable Document Format file with no extractable text. It also includes a downloaded Office Open Extensible Markup Language document to support catchall extraction experiments.
rm -rf corpora/mixed_samples
python scripts/download_mixed_samples.py --corpus corpora/mixed_samples --force
python -m biblicus list --corpus corpora/mixed_samples
Image samples (for optical character recognition experiments)
This example downloads a tiny image corpus intended for optical character recognition experiments: one image that contains text and one that should not.
rm -rf corpora/image_samples
python scripts/download_image_samples.py --corpus corpora/image_samples --force
python -m biblicus list --corpus corpora/image_samples
To perform optical character recognition on the image items, install the optional dependency:
python -m pip install "biblicus[ocr]"
Then build an extraction snapshot:
python -m biblicus extract build --corpus corpora/image_samples --stage ocr-rapidocr
Optional: Unstructured as a last-resort extractor
The unstructured extractor is an optional dependency. It is intended as a last-resort extractor for non-text items.
Install the optional dependency:
python -m pip install "biblicus[unstructured]"
Then build an extraction snapshot:
python -m biblicus extract build --corpus corpora/pdf_samples --stage unstructured
To see Unstructured handle a non-Portable-Document-Format format, use the mixed corpus demo, which includes a .docx sample:
rm -rf corpora/mixed_samples
python scripts/download_mixed_samples.py --corpus corpora/mixed_samples --force
python -m biblicus extract build --corpus corpora/mixed_samples --stage unstructured
When you want to prefer one extractor over another for the same item types, order the stages and end with select-text:
python -m biblicus extract build --corpus corpora/pdf_samples \
--stage unstructured \
--stage pdf-text \
--stage select-text
Optional: Speech to text for audio items
This example downloads a small set of public speech samples from Wikimedia Commons and uses extraction to derive text artifacts. It also includes a generated Waveform Audio File Format silence clip for repeatable non-speech cases.
Download the integration corpus:
rm -rf corpora/audio_samples
python scripts/download_audio_samples.py --corpus corpora/audio_samples --force
python -m biblicus list --corpus corpora/audio_samples
If you only want a metadata-only baseline, extract metadata-text:
python -m biblicus extract build --corpus corpora/audio_samples --stage metadata-text
For real speech to text transcription with the OpenAI backend, install the optional dependency and set an API key:
python -m pip install "biblicus[openai]"
mkdir -p .biblicus
printf "openai:\n api_key: ...\n" > .biblicus/config.yml
python -m biblicus extract build --corpus corpora/audio_samples --stage stt-openai
Build and query the minimal backend
The scan backend is a minimal baseline that reads raw items directly.
python -m biblicus build --corpus corpora/demo --backend scan
python -m biblicus query --corpus corpora/demo --query "Hello"
Backend details: docs/backends.md
Build and query the practical backend
The sqlite full text search backend builds a local index under the snapshot directory.
python -m biblicus build --corpus corpora/demo --backend sqlite-full-text-search --config extraction_snapshot=pipeline:EXTRACTION_SNAPSHOT_ID
python -m biblicus query --corpus corpora/demo --query "tiny"
Backend details: docs/backends.md
Run the test suite and view coverage
python scripts/test.py
open reports/htmlcov/index.html
To include integration scenarios that download public test data at runtime:
python scripts/test.py --integration
Testing details: docs/testing.md
Documentation map
Corpus:
docs/corpus.mdText extraction:
docs/extraction.mdBackends:
docs/backends.mdTesting:
docs/testing.mdRoadmap:
docs/roadmap.md
For what to build next, see docs/roadmap.md.