Markov analysis
Biblicus provides a Markov analysis backend that learns a directed, weighted state transition graph from sequences of text segments in a corpus. It is an exploratory analysis tool that produces structured, inspectable artifacts:
A set of inferred states with per-state exemplars.
A directed, weighted transition graph between states.
A per-item decoded path that shows how each item traversed the state graph.
Optional GraphViz exports for visualization.
Markov analysis is configured using YAML configurations, validated strictly, and stored as versioned snapshot artifacts under the corpus. It is designed for experimentation across segmentation strategies and observation encodings.
Observation encoder configurability
The observation encoder configuration is fully configurable. For hybrid encoders, you can control:
observations.categorical_source: which observation field supplies categorical labelsobservations.numeric_source: which observation field supplies the numeric scalar
This allows hybrid encodings to use fields other than the defaults (for example, llm_summary or segment_index)
without changing the pipeline code.
Topic-driven observations
Markov analysis can run topic modeling over segments and use the resulting topic labels as categorical observations. This is useful when you want topic buckets to act as the observation symbols.
The topic modeling configuration is embedded inside the Markov configuration:
schema_version: 1
segmentation:
method: sentence
topic_modeling:
enabled: true
configuration:
schema_version: 1
llm_extraction:
enabled: false
lexical_processing:
enabled: false
bertopic_analysis:
parameters: {}
observations:
categorical_source: topic_label
model:
family: categorical
n_states: 6
When enabled, the Markov observations include topic_id and topic_label. Setting
observations.categorical_source to topic_label makes the topic labels the categorical symbols
used by the Markov model.
Topic-driven runs emit extra debugging artifacts:
topic_modeling.json: the full topic modeling report used for the run.topic_assignments.jsonl: per-segment topic assignments with the original segment text.entity_removal.jsonl: redacted segment text used for topic modeling (when enabled).
What Markov analysis does
Markov analysis treats each item as a sequence:
Start from extracted text artifacts for a corpus item.
Segment the text into an ordered sequence (sentences, fixed windows, or provider-backed segmentation).
Convert each segment into an observation vector (categorical labels, numeric features, embeddings, or combinations).
Fit a hidden-state Markov model to learn:
latent states,
transition probabilities between states,
and per-state emission distributions.
Decode a most-likely state sequence for each item and emit the state transition graph.
Text extract segmentation
Text extract is a provider-backed segmentation strategy designed for long documents. Instead of asking the model to re-emit the entire text with labels, Biblicus gives the model a virtual file and asks it to insert XML tags in-place. The model uses str_replace tool calls (old_str/new_str pairs), which Biblicus applies in memory.
This pattern has two benefits:
The model only emits a small edit script, which is cheaper and faster than reprinting the full transcript.
The original text remains intact; validation can prove that only tags were inserted.
Biblicus uses XML-style tags (<span>...</span>) so the edits are well-formed and easy to validate deterministically.
After applying the tags, Biblicus parses spans (the tagged text ranges). Interstitial content remains available in the marked-up string without forcing the model to cover every token.
Example snippet:
System prompt excerpt:
System prompt (excerpt):
You are a virtual file editor. Use the available tools to edit the text.
Interpret the word "return" in the user's request as: wrap the returned text with
<span>...</span> in-place in the current text.
Current text:
---
Greeting. Verification. Resolution.
---
User prompt:
User prompt:
Return the segments that represent contiguous phases in the text.
Input text:
Input text:
Greeting. Verification. Resolution.
Marked-up text:
Marked-up text:
<span>Greeting.</span> <span>Verification.</span> <span>Resolution.</span>
Structured data:
Structured data (result):
{
"marked_up_text": "<span>Greeting.</span> <span>Verification.</span> <span>Resolution.</span>",
"spans": [
{"index": 1, "start_char": 0, "end_char": 10, "text": "Greeting."},
{"index": 2, "start_char": 11, "end_char": 24, "text": "Verification."},
{"index": 3, "start_char": 25, "end_char": 37, "text": "Resolution."}
],
"warnings": []
}
Run Markov analysis from the CLI
biblicus analyze markov --corpus corpora/ag_news_demo_2k --configuration configurations/markov/local-discovery.yml
Example span markup configuration (text extract provider-backed):
schema_version: 1
segmentation:
method: span_markup
span_markup:
client:
provider: openai
model: gpt-4o-mini
api_key: null
response_format: json_object
system_prompt: |
You are a virtual file editor. Use the available tools to edit the text.
Interpret the word “return” in the user’s request as: wrap the returned text
with <span>...</span> in-place in the current text.
Use the str_replace tool to insert <span>...</span> tags and the done tool when finished.
When finished, call done. Do NOT return JSON in the assistant message.
Rules:
- Use str_replace only.
- old_str must match exactly once in the current text.
- old_str and new_str must be non-empty strings.
- new_str must be identical to old_str with only <span> and </span> inserted.
- Do not include <span> or </span> inside old_str or new_str.
- Do not insert nested spans.
- If a tool call fails due to non-unique old_str, retry with a longer unique old_str.
- If a tool call fails, read the error and keep editing. Do not call done until spans are inserted.
- Do not delete, reorder, paraphrase, or label text.
Current text:
---
{text}
---
prompt_template: |
Return the segments that represent contiguous phases in the text.
Rules:
- Preserve original order.
- Do not add labels, summaries, or commentary.
- Prefer natural boundaries like greeting/opening, identity verification, reason for call,
clarification, resolution steps, handoff/escalation, closing.
- Use speaker turn changes as possible boundaries, but keep multi-turn exchanges together if they
form a single phase.
- Avoid extremely short fragments; merge tiny leftovers into a neighboring span.
model:
family: gaussian
n_states: 4
observations:
encoder: tfidf
When no extraction snapshot is provided, Markov analysis looks for a default recipe at
corpora/<Corpus>/recipes/extraction/default.yml and builds or reuses the matching snapshot.
Recipes can include an optional max_workers field to control extraction concurrency. To keep
runs reproducible, pass an extraction snapshot explicitly:
biblicus analyze markov \
--corpus corpora/ag_news_demo_2k \
--configuration configurations/markov/local-discovery.yml \
--extraction-snapshot pipeline:RUN_ID
Cascading configurations and CLI overrides
Markov analysis configurations support cascading composition. You can pass multiple --configuration files; later configurations override
earlier configurations via a deep merge:
biblicus analyze markov \
--corpus corpora/ag_news_demo_2k \
--configuration configurations/markov/base.yml \
--configuration configurations/markov/guided.yml
To override the composed configuration view from the command line, use --config key=value with dotted keys:
biblicus analyze markov \
--corpus corpora/ag_news_demo_2k \
--configuration configurations/markov/base.yml \
--configuration configurations/markov/guided.yml \
--config model.n_states=14
Omitted fields use the default values from the Markov analysis schema. Missing required fields remain hard errors.
LLM observation cache
When llm_observations.enabled is true, Biblicus caches per-segment labels and summaries so you can rerun Markov analysis without
re-labeling the same text. The cache is keyed by the LLM client configuration (minus the API key), the prompt templates, and an
optional cache_name, so changing prompts or models creates a fresh cache automatically. Use llm_observations.cache.cache_name
to version caches for experiments.
Cache location:
.biblicus/cache/markov/llm-observations/<cache_id>/<extractor_id>/<snapshot_id>/
The cache is updated incrementally. If a prior run stopped partway through, rerunning the analysis will only label the missing segments and continue.
To disable caching, set:
llm_observations:
cache:
enabled: false
Output location and artifacts
Markov analysis output is stored under:
analysis/markov/<snapshot_id>/
The snapshot directory contains a manifest and structured artifacts. The canonical output is output.json. Additional files
provide intermediate visibility, such as segments and observations used to fit the model. When enabled, GraphViz output
is written as transitions.dot.
Working demo
The integration demo script is a working reference you can use as a starting point:
python scripts/markov_analysis_demo.py --corpus corpora/ag_news_demo_2k
If you want the script to download and ingest AG News into a fresh corpus directory, pass --download (this requires the
optional datasets dependency).
Markov analysis requires an optional dependency:
python -m pip install "biblicus[markov-analysis]"
The demo builds or reuses an extraction snapshot, executes Markov analysis with example configurations, and prints the resulting run
paths. Inspect the emitted output.json and graph artifacts to understand states and transitions.