# Corpus analysis

Biblicus supports analysis backends that run on extracted text artifacts without changing the raw corpus. Analysis is a
pluggable phase that reads an extraction snapshot, produces structured output, and stores artifacts under the corpus runs
folder. Each analysis backend declares its own configuration schema and output contract, and all schemas are validated
strictly.

## How analysis snapshots work

- Analysis runs are tied to a corpus state via the extraction snapshot reference.
- The analysis output is written under `analysis/<analysis-id>/<snapshot_id>/`.
- Analysis is reproducible when you supply the same extraction snapshot and corpus catalog state.
- Analysis configuration is stored as a configuration manifest in the run metadata.

If you omit the extraction snapshot, Biblicus uses the most recent extraction snapshot and emits a reproducibility warning. For
repeatable analysis snapshots, always pass the extraction snapshot reference explicitly.

## Analysis snapshot artifacts

Every analysis snapshot records a manifest alongside the output:

```
analysis/<analysis-id>/<snapshot_id>/
  manifest.json
  output.json
```

The manifest captures the configuration, extraction snapshot reference, and catalog timestamp so results can be reproduced and
compared later.

## Inspecting output

Analysis outputs are JSON documents. You can view them directly:

```
cat corpora/example/analysis/profiling/RUN_ID/output.json
```

Each analysis backend defines its own `report` payload. The run metadata is consistent across backends.

## Comparing analysis snapshots

When you compare analysis results, record:

- Corpus path and catalog timestamp.
- Extraction run reference.
- Analysis configuration name and configuration.
- Analysis snapshot identifier and output path.

These make it possible to rerun the analysis and explain differences.

## Pluggable analysis backends

Analysis backends implement the `CorpusAnalysisBackend` interface and are registered under `biblicus.analysis`.
A backend receives the corpus, a configuration name, a configuration mapping, and an extraction snapshot reference. It returns a
Pydantic model that is serialized to JavaScript Object Notation for storage.

## Choosing an analysis backend

Start with profiling when you need fast, deterministic baselines. Use topic modeling when you want thematic clustering
and exploratory labels. Use Markov analysis when you want state-transition structure over sequences of segments.
Combine multiple backends for a clear view of corpus composition, themes, and state dynamics.

## Configuration files

Analysis configurations are optional JavaScript Object Notation or YAML files that capture configuration in a repeatable way.
They are useful for sharing experiments and keeping runs reproducible.

Recipes support cascading composition. When a command accepts `--configuration`, you can pass multiple configuration files. Biblicus
merges them in order, where later configurations override earlier configurations via a deep merge. You can then apply `--config`
overrides on top of the composed view.

Minimal profiling configuration:

```
schema_version: 1
```

Minimal topic modeling configuration:

```
schema_version: 1
text_source:
  sample_size: 500
bertopic_analysis:
  parameters:
    nr_topics: 8
```

Minimal Markov analysis configuration:

```
schema_version: 1
model:
  family: gaussian
  n_states: 8
segmentation:
  method: sentence
observations:
  encoder: tfidf
```

## Topic modeling

Topic modeling is the first analysis backend. It uses BERTopic to cluster extracted text, produces per-topic evidence,
and optionally labels topics using an LLM. See `docs/topic-modeling.md` for detailed configuration and examples.

The integration demo script is a working reference you can use as a starting point:

```
python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
```

The command prints the analysis snapshot identifier and the output path. Open the resulting `output.json` to inspect per-topic
labels, keywords, and document examples.

## Markov analysis

Markov analysis learns a directed, weighted state transition graph over sequences of text segments. The output includes
per-state exemplars, per-item decoded paths, and optional GraphViz exports. See `docs/markov-analysis.md` for detailed
configuration and examples.

Text extract is available as a segmentation strategy for long texts. It inserts XML tags in-place using a virtual file
editing loop, then extracts spans without requiring the model to re-emit the full transcript.

## Profiling analysis

Profiling is the baseline analysis backend. It summarizes corpus composition and extraction coverage using
deterministic counts and distribution metrics. See `docs/profiling.md` for the full reference and working demo.

### Minimal profiling run

```
python -m biblicus analyze profile --corpus corpora/example --extraction-run pipeline:RUN_ID
```

The command writes an analysis snapshot directory and prints the snapshot identifier.

Run profiling from the CLI:

```
biblicus analyze profile --corpus corpora/example --extraction-run pipeline:RUN_ID
```