Corpus analysis
Biblicus supports analysis backends that run on extracted text artifacts without changing the raw corpus. Analysis is a pluggable phase that reads an extraction snapshot, produces structured output, and stores artifacts under the corpus runs folder. Each analysis backend declares its own configuration schema and output contract, and all schemas are validated strictly.
How analysis snapshots work
Analysis runs are tied to a corpus state via the extraction snapshot reference.
The analysis output is written under
analysis/<analysis-id>/<snapshot_id>/.Analysis is reproducible when you supply the same extraction snapshot and corpus catalog state.
Analysis configuration is stored as a configuration manifest in the run metadata.
If you omit the extraction snapshot, Biblicus uses the most recent extraction snapshot and emits a reproducibility warning. For repeatable analysis snapshots, always pass the extraction snapshot reference explicitly.
Analysis snapshot artifacts
Every analysis snapshot records a manifest alongside the output:
analysis/<analysis-id>/<snapshot_id>/
manifest.json
output.json
The manifest captures the configuration, extraction snapshot reference, and catalog timestamp so results can be reproduced and compared later.
Inspecting output
Analysis outputs are JSON documents. You can view them directly:
cat corpora/example/analysis/profiling/RUN_ID/output.json
Each analysis backend defines its own report payload. The run metadata is consistent across backends.
Comparing analysis snapshots
When you compare analysis results, record:
Corpus path and catalog timestamp.
Extraction run reference.
Analysis configuration name and configuration.
Analysis snapshot identifier and output path.
These make it possible to rerun the analysis and explain differences.
Pluggable analysis backends
Analysis backends implement the CorpusAnalysisBackend interface and are registered under biblicus.analysis.
A backend receives the corpus, a configuration name, a configuration mapping, and an extraction snapshot reference. It returns a
Pydantic model that is serialized to JavaScript Object Notation for storage.
Choosing an analysis backend
Start with profiling when you need fast, deterministic baselines. Use topic modeling when you want thematic clustering and exploratory labels. Use Markov analysis when you want state-transition structure over sequences of segments. Combine multiple backends for a clear view of corpus composition, themes, and state dynamics.
Configuration files
Analysis configurations are optional JavaScript Object Notation or YAML files that capture configuration in a repeatable way. They are useful for sharing experiments and keeping runs reproducible.
Recipes support cascading composition. When a command accepts --configuration, you can pass multiple configuration files. Biblicus
merges them in order, where later configurations override earlier configurations via a deep merge. You can then apply --config
overrides on top of the composed view.
Minimal profiling configuration:
schema_version: 1
Minimal topic modeling configuration:
schema_version: 1
text_source:
sample_size: 500
bertopic_analysis:
parameters:
nr_topics: 8
Minimal Markov analysis configuration:
schema_version: 1
model:
family: gaussian
n_states: 8
segmentation:
method: sentence
observations:
encoder: tfidf
Topic modeling
Topic modeling is the first analysis backend. It uses BERTopic to cluster extracted text, produces per-topic evidence,
and optionally labels topics using an LLM. See docs/topic-modeling.md for detailed configuration and examples.
The integration demo script is a working reference you can use as a starting point:
python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
The command prints the analysis snapshot identifier and the output path. Open the resulting output.json to inspect per-topic
labels, keywords, and document examples.
Markov analysis
Markov analysis learns a directed, weighted state transition graph over sequences of text segments. The output includes
per-state exemplars, per-item decoded paths, and optional GraphViz exports. See docs/markov-analysis.md for detailed
configuration and examples.
Text extract is available as a segmentation strategy for long texts. It inserts XML tags in-place using a virtual file editing loop, then extracts spans without requiring the model to re-emit the full transcript.
Profiling analysis
Profiling is the baseline analysis backend. It summarizes corpus composition and extraction coverage using
deterministic counts and distribution metrics. See docs/profiling.md for the full reference and working demo.
Minimal profiling run
python -m biblicus analyze profile --corpus corpora/example --extraction-run pipeline:RUN_ID
The command writes an analysis snapshot directory and prints the snapshot identifier.
Run profiling from the CLI:
biblicus analyze profile --corpus corpora/example --extraction-run pipeline:RUN_ID