# Demos This document is a set of runnable examples you can use to see the current system working end to end. Each section links to a textbook chapter so you can read the concept and then run the code. For the ordered plan of what to build next, see `docs/roadmap.md`. ## Working examples you can run now Use the examples in order if you are new to the system. They build from ingestion to extraction, retrieval, evaluation, and analysis. ### Install for local development From the repository root: ``` python -m pip install -e ".[dev]" ``` ### Create a corpus and ingest a few items ``` rm -rf corpora/demo python -m biblicus init corpora/demo python -m biblicus ingest --corpus corpora/demo --note "Hello from a note" --title "First note" --tags "demo,notes" printf "A tiny text file\n" > /tmp/biblicus-demo.txt python -m biblicus ingest --corpus corpora/demo /tmp/biblicus-demo.txt python -m biblicus ingest --corpus corpora/demo https://example.com python -m biblicus list --corpus corpora/demo ``` ### Show an item Copy an item identifier from the list output, then run: ``` python -m biblicus show --corpus corpora/demo ITEM_ID ``` ### Edit raw files and reindex The catalog is rebuildable. You can edit raw files or sidecar metadata, then refresh the catalog. ``` python -m biblicus reindex --corpus corpora/demo ``` ### Crawl a website prefix To turn a website section into corpus items, crawl a root page and restrict the crawl to an allowed prefix. In one terminal, create a tiny local website and serve it: ``` rm -rf /tmp/biblicus-site mkdir -p /tmp/biblicus-site/site/subdir cat > /tmp/biblicus-site/site/index.html <<'HTML'
Page Subdir HTML cat > /tmp/biblicus-site/site/page.html <<'HTML' hello HTML cat > /tmp/biblicus-site/site/subdir/index.html <<'HTML' subdir HTML python -m http.server 8000 --directory /tmp/biblicus-site ``` In another terminal: ``` rm -rf corpora/crawl-demo python -m biblicus init corpora/crawl-demo python -m biblicus crawl --corpus corpora/crawl-demo \ --root-url http://127.0.0.1:8000/site/index.html \ --allowed-prefix http://127.0.0.1:8000/site/ \ --max-items 50 \ --tag crawled python -m biblicus list --corpus corpora/crawl-demo ``` ### Build an extraction snapshot Text extraction is a separate pipeline stage from retrieval. An extraction snapshot produces derived text artifacts under the corpus. This extractor reads text items and skips non-text items. ``` python -m biblicus extract build --corpus corpora/demo --stage pass-through-text ``` The output includes a `snapshot_id` you can reuse when building a retrieval backend. Text extraction details: `docs/extraction.md` ### Graph extraction demo Graph extraction runs after text extraction and writes to a Neo4j backend. The demo script will reuse the latest extraction snapshot or build a minimal one if needed. ``` python scripts/graph_extraction_demo.py --corpus corpora/demo --build-extraction --prepare-demo --verify ``` Graph extraction details: `docs/graph-extraction.md` ### Graph extraction integration run Use the integration script to download a small Wikipedia corpus, run extraction, and build a Neo4j graph snapshot with the `simple-entities` extractor. ``` python -m pip install neo4j ``` ``` python scripts/graph_extraction_integration.py \ --corpus corpora/wiki_graph_demo \ --force \ --verify \ --report-path reports/graph_extraction_story.md ``` The report written to `reports/graph_extraction_story.md` summarizes the run in a shareable format. ### Graph extraction baselines Once the baseline extractors are enabled, you can compare different graph extractors by switching the extractor id and configuration: ``` python -m biblicus graph extract \ --corpus corpora/example \ --extractor ner-entities \ --extraction-snapshot pipeline:RUN_ID \ --configuration configurations/graph/ner-entities.yml ``` ``` python -m biblicus graph extract \ --corpus corpora/example \ --extractor dependency-relations \ --extraction-snapshot pipeline:RUN_ID \ --configuration configurations/graph/dependency-relations.yml ``` ### Graph extractor narrative demos Use the narrative demo script to run the full pipeline for a single extractor and print inputs + outputs. Each command downloads a small Wikipedia corpus, builds extraction and graph snapshots, and prints sample entities/terms and edges. ``` python scripts/graph_extraction_extractor_demo.py \ --corpus corpora/wiki_graph_demo \ --force \ --extractor simple-entities ``` ``` python scripts/graph_extraction_extractor_demo.py \ --corpus corpora/wiki_graph_demo \ --force \ --extractor cooccurrence ``` ``` python scripts/graph_extraction_extractor_demo.py \ --corpus corpora/wiki_graph_demo \ --force \ --extractor ner-entities ``` ``` python scripts/graph_extraction_extractor_demo.py \ --corpus corpora/wiki_graph_demo \ --force \ --extractor dependency-relations ``` ### Graph extractor narrative demos (all extractors) Use the multi-extractor demo script to run the narrative demo for every extractor in sequence. The first extractor run initializes the corpus; the remaining runs reuse it. ``` python scripts/graph_extraction_demo_all.py \ --corpus corpora/wiki_graph_demo \ --force \ --limit 5 \ --report-dir reports ``` Graph extraction details: `docs/graph-extraction.md` ### Topic modeling integration run Use the integration script to download AG News, run extraction, and run topic modeling with a single command. Install optional dependencies first: ``` python -m pip install "biblicus[datasets,topic-modeling]" ``` ``` python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force ``` Topic modeling details: `docs/topic-modeling.md` ### Extraction evaluation demo run Use the extraction evaluation demo to build an extraction snapshot, write a labeled dataset from AG News items, and evaluate coverage and accuracy. Install optional dependencies first: ``` python -m pip install "biblicus[datasets]" ``` ``` python scripts/extraction_evaluation_demo.py --corpus corpora/ag_news_extraction_eval --force ``` The script prints the dataset path, extraction snapshot reference, and evaluation output path so you can inspect the results. Extraction evaluation details: `docs/extraction-evaluation.md` ### Extraction evaluation lab run Use the lab script for a fast, fully local walkthrough with bundled files and labels: ``` python scripts/extraction_evaluation_lab.py --corpus corpora/extraction_eval_lab --force ``` The lab writes a generated dataset file and evaluation output path and prints both in the command output. Extraction evaluation lab details: `docs/extraction-evaluation.md` ### Retrieval evaluation lab run Use the retrieval evaluation lab to build a tiny corpus, run extraction, build a retrieval backend, and evaluate it against bundled labels: ``` python scripts/retrieval_evaluation_lab.py --corpus corpora/retrieval_eval_lab --force ``` The script prints the dataset path, retrieval snapshot identifier, and evaluation output location. Retrieval evaluation details: `docs/retrieval-evaluation.md` Run with a larger corpus and a higher topic count: ``` python scripts/topic_modeling_integration.py \ --corpus corpora/ag_news_demo \ --force \ --limit 10000 \ --vectorizer-ngram-min 1 \ --vectorizer-ngram-max 2 \ --bertopic-param nr_topics=8 \ --bertopic-param min_topic_size=2 ``` The command prints the analysis snapshot identifier and the output path. Open the `output.json` file to inspect per-topic labels, keywords, and document examples. ### Profiling analysis demo The profiling demo downloads AG News, runs extraction, and produces a profiling report. ``` python scripts/profiling_demo.py --corpus corpora/profiling_demo --force ``` Profiling details: `docs/profiling.md` ### Select extracted text within a pipeline When you want an explicit choice among multiple extraction outputs, add a selection extractor stage at the end of the pipeline. ``` python -m biblicus extract build --corpus corpora/demo \ --stage pass-through-text \ --stage metadata-text \ --stage select-text ``` Copy the `snapshot_id` from the JavaScript Object Notation output. Use it as `EXTRACTION_SNAPSHOT_ID` in the next command. ``` python -m biblicus build --corpus corpora/demo --backend sqlite-full-text-search \ --config extraction_snapshot=pipeline:EXTRACTION_SNAPSHOT_ID ``` Extraction pipeline details: `docs/extraction.md` ### Portable Document Format extraction and retrieval This example downloads a small set of public Portable Document Format files, extracts text, builds a local full text index, and runs a query. ``` rm -rf corpora/pdf_samples python scripts/download_pdf_samples.py --corpus corpora/pdf_samples --force python -m biblicus extract build --corpus corpora/pdf_samples --stage pdf-text ``` Copy the `snapshot_id` from the JavaScript Object Notation output. Use it as `PDF_EXTRACTION_SNAPSHOT_ID` in the next command. ``` python -m biblicus build --corpus corpora/pdf_samples --backend sqlite-full-text-search --config extraction_snapshot=pipeline:PDF_EXTRACTION_SNAPSHOT_ID --config chunk_size=200 --config chunk_overlap=50 --config snippet_characters=120 python -m biblicus query --corpus corpora/pdf_samples --query "Dummy PDF file" ``` Retrieval details: `docs/retrieval.md` ### MarkItDown extraction demo (Python 3.10+) MarkItDown requires Python 3.10 or higher. This example uses the `py311` conda environment to run the extractor over the mixed sample corpus. ``` conda run -n py311 python -m pip install -e . "markitdown[all]" conda run -n py311 python scripts/download_mixed_samples.py --corpus corpora/markitdown_demo_py311 --force conda run -n py311 python -m biblicus extract build --corpus corpora/markitdown_demo_py311 --stage markitdown ``` ### Mixed modality integration corpus This example assembles a tiny mixed corpus with a Markdown note, a Hypertext Markup Language page, an image, a Portable Document Format file with extractable text, and a generated Portable Document Format file with no extractable text. It also includes a downloaded Office Open Extensible Markup Language document to support catchall extraction experiments. ``` rm -rf corpora/mixed_samples python scripts/download_mixed_samples.py --corpus corpora/mixed_samples --force python -m biblicus list --corpus corpora/mixed_samples ``` ### Image samples (for optical character recognition experiments) This example downloads a tiny image corpus intended for optical character recognition experiments: one image that contains text and one that should not. ``` rm -rf corpora/image_samples python scripts/download_image_samples.py --corpus corpora/image_samples --force python -m biblicus list --corpus corpora/image_samples ``` To perform optical character recognition on the image items, install the optional dependency: ``` python -m pip install "biblicus[ocr]" ``` Then build an extraction snapshot: ``` python -m biblicus extract build --corpus corpora/image_samples --stage ocr-rapidocr ``` ### Optional: Unstructured as a last-resort extractor The `unstructured` extractor is an optional dependency. It is intended as a last-resort extractor for non-text items. Install the optional dependency: ``` python -m pip install "biblicus[unstructured]" ``` Then build an extraction snapshot: ``` python -m biblicus extract build --corpus corpora/pdf_samples --stage unstructured ``` To see Unstructured handle a non-Portable-Document-Format format, use the mixed corpus demo, which includes a `.docx` sample: ``` rm -rf corpora/mixed_samples python scripts/download_mixed_samples.py --corpus corpora/mixed_samples --force python -m biblicus extract build --corpus corpora/mixed_samples --stage unstructured ``` When you want to prefer one extractor over another for the same item types, order the stages and end with `select-text`: ``` python -m biblicus extract build --corpus corpora/pdf_samples \ --stage unstructured \ --stage pdf-text \ --stage select-text ``` ### Optional: Speech to text for audio items This example downloads a small set of public speech samples from Wikimedia Commons and uses extraction to derive text artifacts. It also includes a generated Waveform Audio File Format silence clip for repeatable non-speech cases. Download the integration corpus: ``` rm -rf corpora/audio_samples python scripts/download_audio_samples.py --corpus corpora/audio_samples --force python -m biblicus list --corpus corpora/audio_samples ``` If you only want a metadata-only baseline, extract `metadata-text`: ``` python -m biblicus extract build --corpus corpora/audio_samples --stage metadata-text ``` For real speech to text transcription with the OpenAI backend, install the optional dependency and set an API key: ``` python -m pip install "biblicus[openai]" mkdir -p .biblicus printf "openai:\n api_key: ...\n" > .biblicus/config.yml python -m biblicus extract build --corpus corpora/audio_samples --stage stt-openai ``` ### Build and query the minimal backend The scan backend is a minimal baseline that reads raw items directly. ``` python -m biblicus build --corpus corpora/demo --backend scan python -m biblicus query --corpus corpora/demo --query "Hello" ``` Backend details: `docs/backends.md` ### Build and query the practical backend The sqlite full text search backend builds a local index under the snapshot directory. ``` python -m biblicus build --corpus corpora/demo --backend sqlite-full-text-search --config extraction_snapshot=pipeline:EXTRACTION_SNAPSHOT_ID python -m biblicus query --corpus corpora/demo --query "tiny" ``` Backend details: `docs/backends.md` ### Run the test suite and view coverage ``` python scripts/test.py open reports/htmlcov/index.html ``` To include integration scenarios that download public test data at runtime: ``` python scripts/test.py --integration ``` Testing details: `docs/testing.md` ## Documentation map - Corpus: `docs/corpus.md` - Text extraction: `docs/extraction.md` - Backends: `docs/backends.md` - Testing: `docs/testing.md` - Roadmap: `docs/roadmap.md` For what to build next, see `docs/roadmap.md`.