# Topic modeling

Biblicus provides a topic modeling analysis backend that reads extracted text artifacts, optionally applies an LLM
extraction pass, optionally removes named entities, applies lexical processing, runs BERTopic, and optionally applies
an LLM fine-tuning pass for labels. The output is structured JavaScript Object Notation with explicit per-topic
evidence.

## What topic modeling does

Topic modeling groups documents into clusters based on shared terms or phrases, then surfaces representative
keywords for each cluster. It is a fast way to summarize large corpora, identify dominant themes, and spot outliers
without manual labeling. The output is not a classifier; it is an exploratory tool that produces evidence that can
be inspected or reviewed by humans.

## About BERTopic

BERTopic combines document embeddings with clustering and a class-based term frequency approach to extract topic
keywords. Biblicus supports BERTopic as an optional dependency and forwards its configuration parameters directly to
the BERTopic constructor. This allows you to tune clustering behavior while keeping the output in a consistent
schema.

## Pipeline stages

- Text collection reads extracted text artifacts from an extraction snapshot.
- LLM extraction optionally transforms each document into one or more analysis documents.
- Entity removal optionally deletes named entities before modeling.
- Lexical processing optionally normalizes text before BERTopic.
- BERTopic produces topic assignments and keyword weights.
- LLM fine-tuning optionally replaces topic labels based on sampled documents.

## Run topic modeling from the CLI

```
biblicus analyze topics --corpus corpora/example --configuration configurations/topic-modeling.yml --extraction-run pipeline:RUN_ID
```

Topic modeling configurations support cascading composition. Pass multiple `--configuration` files; later configurations override earlier
configurations via a deep merge:

```
biblicus analyze topics \
  --corpus corpora/example \
  --configuration configurations/topic-modeling/base.yml \
  --configuration configurations/topic-modeling/ag-news.yml \
  --extraction-run pipeline:RUN_ID
```

To override the composed configuration view from the command line, use `--config key=value` with dotted keys:

```
biblicus analyze topics \
  --corpus corpora/example \
  --configuration configurations/topic-modeling/base.yml \
  --configuration configurations/topic-modeling/ag-news.yml \
  --config bertopic_analysis.parameters.nr_topics=12 \
  --extraction-run pipeline:RUN_ID
```

If you omit `--extraction-run`, Biblicus uses the latest extraction snapshot and emits a reproducibility warning.

## Output structure

Topic modeling writes a single `output.json` file under the analysis snapshot directory. The output contains:

- `run.snapshot_id` and `run.stats` for reproducible tracking.
- `report.topics` with the modeled topics.
- `report.text_collection`, `report.llm_extraction`, `report.entity_removal`, `report.lexical_processing`,
  `report.bertopic_analysis`, and `report.llm_fine_tuning` describing each pipeline stage.

When entity removal is enabled, Biblicus also writes `entity_removal.jsonl` alongside `output.json`. This artifact
contains the redacted documents that feed BERTopic and is reused on subsequent runs for the same snapshot.

Each topic record includes:

- `topic_id`: The BERTopic topic identifier. The outlier topic uses `-1`.
- `label`: The human-readable label.
- `label_source`: `bertopic` or `llm` depending on the stage that set the label.
- `keywords`: Keyword list with weights.
- `document_count`: Number of documents assigned to the topic.
- `document_ids`: Item identifiers for the assigned documents.
- `document_examples`: Sampled document text used for inspection.

Per-topic behavior is determined by the BERTopic assignments and the optional fine-tuning stage. The lexical
processing flags can substantially change tokenization and therefore the resulting topic labels. The outlier
`topic_id` `-1` indicates documents that BERTopic could not confidently assign to a cluster.

### Reading a topic record

Each topic includes evidence you can inspect. A shortened example:

```json
{
  "topic_id": 2,
  "label": "global markets and stocks",
  "label_source": "bertopic",
  "keywords": [
    {"keyword": "stocks", "weight": 0.42},
    {"keyword": "market", "weight": 0.37}
  ],
  "document_count": 124,
  "document_examples": [
    "Stocks climbed after the earnings report ...",
    "Markets opened higher as investors ..."
  ]
}
```

Use `document_examples` as a sanity check, then trace `document_ids` back to the corpus for deeper inspection.

## Configuration reference

Topic modeling configurations use a strict schema. Unknown fields or type mismatches are errors.

### Text source

- `text_source.sample_size`: Limit the number of documents used for analysis.
- `text_source.min_text_characters`: Drop documents shorter than this count.

### LLM extraction

- `llm_extraction.enabled`: Enable the LLM extraction stage.
- `llm_extraction.method`: `single` or `itemize` to control whether an input maps to one or many documents.
- `llm_extraction.client`: LLM client configuration (requires `biblicus[openai]`).
- `llm_extraction.prompt_template`: Prompt template for the extraction stage.
- `llm_extraction.system_prompt`: Optional system prompt.

### Entity removal

- `entity_removal.enabled`: Enable local named-entity removal.
- `entity_removal.provider`: `spacy` (required).
- `entity_removal.model`: spaCy model name (for example `en_core_web_sm`).
- `entity_removal.entity_types`: Entity labels to remove. Empty uses defaults.
- `entity_removal.replace_with`: Replacement text inserted for removed entities.
- `entity_removal.collapse_whitespace`: Normalize whitespace after removals.
- `entity_removal.regex_patterns`: Optional regex patterns applied after NER.
- `entity_removal.regex_replace_with`: Replacement text for regex removals.

### Lexical processing

- `lexical_processing.enabled`: Enable normalization.
- `lexical_processing.lowercase`: Lowercase text before tokenization.
- `lexical_processing.strip_punctuation`: Remove punctuation before tokenization.
- `lexical_processing.collapse_whitespace`: Normalize repeated whitespace.

### BERTopic configuration

- `bertopic_analysis.parameters`: Mapping of BERTopic constructor parameters.
- `bertopic_analysis.vectorizer.ngram_range`: Inclusive n-gram range (for example `[1, 2]`).
- `bertopic_analysis.vectorizer.stop_words`: `english` or a list of stop words. Set to `null` to disable.

### LLM fine-tuning

- `llm_fine_tuning.enabled`: Enable LLM topic labeling.
- `llm_fine_tuning.client`: LLM client configuration.
- `llm_fine_tuning.prompt_template`: Prompt template containing `{keywords}` and `{documents}`.
- `llm_fine_tuning.system_prompt`: Optional system prompt.
- `llm_fine_tuning.max_keywords`: Maximum keywords included per prompt.
- `llm_fine_tuning.max_documents`: Maximum documents included per prompt.

## Vectorizer configuration

Biblicus forwards BERTopic configuration through `bertopic_analysis.parameters` and exposes vectorizer settings
through `bertopic_analysis.vectorizer`. To include bigrams, set `ngram_range` to `[1, 2]`. To remove stop words,
set `stop_words` to `english` or a list.

```yaml
bertopic_analysis:
  parameters:
    min_topic_size: 10
    nr_topics: 12
  vectorizer:
    ngram_range: [1, 2]
    stop_words: english
```

## Repeatable integration script

The integration script downloads AG News, runs extraction, and then runs topic modeling with the selected
parameters. It prints a summary with the analysis snapshot identifier and the output path.

```
python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
```

### Example: raise topic count

```
python scripts/topic_modeling_integration.py \
  --corpus corpora/ag_news_demo \
  --force \
  --limit 10000 \
  --vectorizer-ngram-min 1 \
  --vectorizer-ngram-max 2 \
  --bertopic-param nr_topics=8 \
  --bertopic-param min_topic_size=2
```

### Example: disable lexical processing and restrict inputs

```
python scripts/topic_modeling_integration.py \
  --corpus corpora/ag_news_demo \
  --force \
  --sample-size 200 \
  --min-text-characters 200 \
  --no-lexical-enabled
```

### Example: keep lexical processing but preserve punctuation

```
python scripts/topic_modeling_integration.py \
  --corpus corpora/ag_news_demo \
  --force \
  --no-lexical-strip-punctuation
```

BERTopic parameters are passed directly to the constructor. Use repeated `--bertopic-param key=value` pairs for
multiple parameters. Values that look like JSON objects or arrays are parsed as JSON.

The integration script requires at least 16 documents to avoid BERTopic default UMAP errors. Increase `--limit` or
use a larger corpus if you receive a small-corpus error.

AG News downloads require the `datasets` dependency. Install with:

```
python -m pip install "biblicus[datasets,topic-modeling]"
```

## Tuning workflow

Start with a small sample to validate the pipeline, then scale up:

1) Run with `--limit 500` to validate extraction and output structure.
2) Add bigrams and stop words to reduce noise in keyword lists.
3) Increase `--limit` or `--sample-size` once topics look stable.
4) Experiment with `nr_topics` and `min_topic_size` to control granularity.

## Interpreting results

When a topic looks off, inspect the `document_examples` and compare them to the keyword list. If the documents do not
match the keywords, adjust lexical processing or increase `min_topic_size` to reduce noise.

## Common pitfalls

- Using too few documents for BERTopic defaults (aim for at least 16).
- Forgetting to enable stop words and ending up with filler topics.
- Comparing runs that used different extraction prompts or lexical settings.