Markov analysis

Biblicus provides a Markov analysis backend that learns a directed, weighted state transition graph from sequences of text segments in a corpus. It is an exploratory analysis tool that produces structured, inspectable artifacts:

A set of inferred states with per-state exemplars.
A directed, weighted transition graph between states.
A per-item decoded path that shows how each item traversed the state graph.
Optional GraphViz exports for visualization.

Markov analysis is configured using YAML configurations, validated strictly, and stored as versioned snapshot artifacts under the corpus. It is designed for experimentation across segmentation strategies and observation encodings.

Observation encoder configurability

The observation encoder configuration is fully configurable. For hybrid encoders, you can control:

observations.categorical_source: which observation field supplies categorical labels
observations.numeric_source: which observation field supplies the numeric scalar

This allows hybrid encodings to use fields other than the defaults (for example, llm_summary or segment_index) without changing the pipeline code.

Topic-driven observations

Markov analysis can run topic modeling over segments and use the resulting topic labels as categorical observations. This is useful when you want topic buckets to act as the observation symbols.

The topic modeling configuration is embedded inside the Markov configuration:

schema_version: 1
segmentation:
  method: sentence
topic_modeling:
  enabled: true
  configuration:
    schema_version: 1
    llm_extraction:
      enabled: false
    lexical_processing:
      enabled: false
    bertopic_analysis:
      parameters: {}
observations:
  categorical_source: topic_label
model:
  family: categorical
  n_states: 6

When enabled, the Markov observations include topic_id and topic_label. Setting observations.categorical_source to topic_label makes the topic labels the categorical symbols used by the Markov model.

Topic-driven runs emit extra debugging artifacts:

topic_modeling.json: the full topic modeling report used for the run.
topic_assignments.jsonl: per-segment topic assignments with the original segment text.
entity_removal.jsonl: redacted segment text used for topic modeling (when enabled).

What Markov analysis does

Markov analysis treats each item as a sequence:

Start from extracted text artifacts for a corpus item.
Segment the text into an ordered sequence (sentences, fixed windows, or provider-backed segmentation).
Convert each segment into an observation vector (categorical labels, numeric features, embeddings, or combinations).
Fit a hidden-state Markov model to learn:
- latent states,
- transition probabilities between states,
- and per-state emission distributions.
Decode a most-likely state sequence for each item and emit the state transition graph.

Text extract segmentation

Text extract is a provider-backed segmentation strategy designed for long documents. Instead of asking the model to re-emit the entire text with labels, Biblicus gives the model a virtual file and asks it to insert XML tags in-place. The model uses str_replace tool calls (old_str/new_str pairs), which Biblicus applies in memory.

This pattern has two benefits:

The model only emits a small edit script, which is cheaper and faster than reprinting the full transcript.
The original text remains intact; validation can prove that only tags were inserted.

Biblicus uses XML-style tags (<span>...</span>) so the edits are well-formed and easy to validate deterministically.

After applying the tags, Biblicus parses spans (the tagged text ranges). Interstitial content remains available in the marked-up string without forcing the model to cover every token.

Example snippet:

System prompt excerpt:

System prompt (excerpt):

You are a virtual file editor. Use the available tools to edit the text.
Interpret the word "return" in the user's request as: wrap the returned text with
<span>...</span> in-place in the current text.
Current text:
---
Greeting. Verification. Resolution.
---

User prompt:

User prompt:

Return the segments that represent contiguous phases in the text.

Input text:

Input text:

Greeting. Verification. Resolution.

Marked-up text:

Marked-up text:

<span>Greeting.</span> <span>Verification.</span> <span>Resolution.</span>

Structured data:

Structured data (result):

{
  "marked_up_text": "<span>Greeting.</span> <span>Verification.</span> <span>Resolution.</span>",
  "spans": [
    {"index": 1, "start_char": 0, "end_char": 10, "text": "Greeting."},
    {"index": 2, "start_char": 11, "end_char": 24, "text": "Verification."},
    {"index": 3, "start_char": 25, "end_char": 37, "text": "Resolution."}
  ],
  "warnings": []
}

Run Markov analysis from the CLI

biblicus analyze markov --corpus corpora/ag_news_demo_2k --configuration configurations/markov/local-discovery.yml

Example span markup configuration (text extract provider-backed):

schema_version: 1
segmentation:
  method: span_markup
  span_markup:
    client:
      provider: openai
      model: gpt-4o-mini
      api_key: null
      response_format: json_object
    system_prompt: |
      You are a virtual file editor. Use the available tools to edit the text.
      Interpret the word “return” in the user’s request as: wrap the returned text
      with <span>...</span> in-place in the current text.

      Use the str_replace tool to insert <span>...</span> tags and the done tool when finished.
      When finished, call done. Do NOT return JSON in the assistant message.

      Rules:
      - Use str_replace only.
      - old_str must match exactly once in the current text.
      - old_str and new_str must be non-empty strings.
      - new_str must be identical to old_str with only <span> and </span> inserted.
      - Do not include <span> or </span> inside old_str or new_str.
      - Do not insert nested spans.
      - If a tool call fails due to non-unique old_str, retry with a longer unique old_str.
      - If a tool call fails, read the error and keep editing. Do not call done until spans are inserted.
      - Do not delete, reorder, paraphrase, or label text.
      Current text:
      ---
      {text}
      ---
    prompt_template: |
      Return the segments that represent contiguous phases in the text.

      Rules:
      - Preserve original order.
      - Do not add labels, summaries, or commentary.
      - Prefer natural boundaries like greeting/opening, identity verification, reason for call,
        clarification, resolution steps, handoff/escalation, closing.
      - Use speaker turn changes as possible boundaries, but keep multi-turn exchanges together if they
        form a single phase.
      - Avoid extremely short fragments; merge tiny leftovers into a neighboring span.
model:
  family: gaussian
  n_states: 4
observations:
  encoder: tfidf

When no extraction snapshot is provided, Markov analysis looks for a default recipe at corpora/<Corpus>/recipes/extraction/default.yml and builds or reuses the matching snapshot. Recipes can include an optional max_workers field to control extraction concurrency. To keep runs reproducible, pass an extraction snapshot explicitly:

biblicus analyze markov \
  --corpus corpora/ag_news_demo_2k \
  --configuration configurations/markov/local-discovery.yml \
  --extraction-snapshot pipeline:RUN_ID

Cascading configurations and CLI overrides

Markov analysis configurations support cascading composition. You can pass multiple --configuration files; later configurations override earlier configurations via a deep merge:

biblicus analyze markov \
  --corpus corpora/ag_news_demo_2k \
  --configuration configurations/markov/base.yml \
  --configuration configurations/markov/guided.yml

To override the composed configuration view from the command line, use --config key=value with dotted keys:

biblicus analyze markov \
  --corpus corpora/ag_news_demo_2k \
  --configuration configurations/markov/base.yml \
  --configuration configurations/markov/guided.yml \
  --config model.n_states=14

Omitted fields use the default values from the Markov analysis schema. Missing required fields remain hard errors.

LLM observation cache

When llm_observations.enabled is true, Biblicus caches per-segment labels and summaries so you can rerun Markov analysis without re-labeling the same text. The cache is keyed by the LLM client configuration (minus the API key), the prompt templates, and an optional cache_name, so changing prompts or models creates a fresh cache automatically. Use llm_observations.cache.cache_name to version caches for experiments.

Cache location:

.biblicus/cache/markov/llm-observations/<cache_id>/<extractor_id>/<snapshot_id>/

The cache is updated incrementally. If a prior run stopped partway through, rerunning the analysis will only label the missing segments and continue.

To disable caching, set:

llm_observations:
  cache:
    enabled: false

Output location and artifacts

Markov analysis output is stored under:

analysis/markov/<snapshot_id>/

The snapshot directory contains a manifest and structured artifacts. The canonical output is output.json. Additional files provide intermediate visibility, such as segments and observations used to fit the model. When enabled, GraphViz output is written as transitions.dot.

Working demo

The integration demo script is a working reference you can use as a starting point:

python scripts/markov_analysis_demo.py --corpus corpora/ag_news_demo_2k

If you want the script to download and ingest AG News into a fresh corpus directory, pass --download (this requires the optional datasets dependency).

Markov analysis requires an optional dependency:

python -m pip install "biblicus[markov-analysis]"

The demo builds or reuses an extraction snapshot, executes Markov analysis with example configurations, and prints the resulting run paths. Inspect the emitted output.json and graph artifacts to understand states and transitions.