Topic modeling

Biblicus provides a topic modeling analysis backend that reads extracted text artifacts, optionally applies an LLM extraction pass, optionally removes named entities, applies lexical processing, runs BERTopic, and optionally applies an LLM fine-tuning pass for labels. The output is structured JavaScript Object Notation with explicit per-topic evidence.

What topic modeling does

Topic modeling groups documents into clusters based on shared terms or phrases, then surfaces representative keywords for each cluster. It is a fast way to summarize large corpora, identify dominant themes, and spot outliers without manual labeling. The output is not a classifier; it is an exploratory tool that produces evidence that can be inspected or reviewed by humans.

About BERTopic

BERTopic combines document embeddings with clustering and a class-based term frequency approach to extract topic keywords. Biblicus supports BERTopic as an optional dependency and forwards its configuration parameters directly to the BERTopic constructor. This allows you to tune clustering behavior while keeping the output in a consistent schema.

Pipeline stages

  • Text collection reads extracted text artifacts from an extraction snapshot.

  • LLM extraction optionally transforms each document into one or more analysis documents.

  • Entity removal optionally deletes named entities before modeling.

  • Lexical processing optionally normalizes text before BERTopic.

  • BERTopic produces topic assignments and keyword weights.

  • LLM fine-tuning optionally replaces topic labels based on sampled documents.

Run topic modeling from the CLI

biblicus analyze topics --corpus corpora/example --configuration configurations/topic-modeling.yml --extraction-run pipeline:RUN_ID

Topic modeling configurations support cascading composition. Pass multiple --configuration files; later configurations override earlier configurations via a deep merge:

biblicus analyze topics \
  --corpus corpora/example \
  --configuration configurations/topic-modeling/base.yml \
  --configuration configurations/topic-modeling/ag-news.yml \
  --extraction-run pipeline:RUN_ID

To override the composed configuration view from the command line, use --config key=value with dotted keys:

biblicus analyze topics \
  --corpus corpora/example \
  --configuration configurations/topic-modeling/base.yml \
  --configuration configurations/topic-modeling/ag-news.yml \
  --config bertopic_analysis.parameters.nr_topics=12 \
  --extraction-run pipeline:RUN_ID

If you omit --extraction-run, Biblicus uses the latest extraction snapshot and emits a reproducibility warning.

Output structure

Topic modeling writes a single output.json file under the analysis snapshot directory. The output contains:

  • run.snapshot_id and run.stats for reproducible tracking.

  • report.topics with the modeled topics.

  • report.text_collection, report.llm_extraction, report.entity_removal, report.lexical_processing, report.bertopic_analysis, and report.llm_fine_tuning describing each pipeline stage.

When entity removal is enabled, Biblicus also writes entity_removal.jsonl alongside output.json. This artifact contains the redacted documents that feed BERTopic and is reused on subsequent runs for the same snapshot.

Each topic record includes:

  • topic_id: The BERTopic topic identifier. The outlier topic uses -1.

  • label: The human-readable label.

  • label_source: bertopic or llm depending on the stage that set the label.

  • keywords: Keyword list with weights.

  • document_count: Number of documents assigned to the topic.

  • document_ids: Item identifiers for the assigned documents.

  • document_examples: Sampled document text used for inspection.

Per-topic behavior is determined by the BERTopic assignments and the optional fine-tuning stage. The lexical processing flags can substantially change tokenization and therefore the resulting topic labels. The outlier topic_id -1 indicates documents that BERTopic could not confidently assign to a cluster.

Reading a topic record

Each topic includes evidence you can inspect. A shortened example:

{
  "topic_id": 2,
  "label": "global markets and stocks",
  "label_source": "bertopic",
  "keywords": [
    {"keyword": "stocks", "weight": 0.42},
    {"keyword": "market", "weight": 0.37}
  ],
  "document_count": 124,
  "document_examples": [
    "Stocks climbed after the earnings report ...",
    "Markets opened higher as investors ..."
  ]
}

Use document_examples as a sanity check, then trace document_ids back to the corpus for deeper inspection.

Configuration reference

Topic modeling configurations use a strict schema. Unknown fields or type mismatches are errors.

Text source

  • text_source.sample_size: Limit the number of documents used for analysis.

  • text_source.min_text_characters: Drop documents shorter than this count.

LLM extraction

  • llm_extraction.enabled: Enable the LLM extraction stage.

  • llm_extraction.method: single or itemize to control whether an input maps to one or many documents.

  • llm_extraction.client: LLM client configuration (requires biblicus[openai]).

  • llm_extraction.prompt_template: Prompt template for the extraction stage.

  • llm_extraction.system_prompt: Optional system prompt.

Entity removal

  • entity_removal.enabled: Enable local named-entity removal.

  • entity_removal.provider: spacy (required).

  • entity_removal.model: spaCy model name (for example en_core_web_sm).

  • entity_removal.entity_types: Entity labels to remove. Empty uses defaults.

  • entity_removal.replace_with: Replacement text inserted for removed entities.

  • entity_removal.collapse_whitespace: Normalize whitespace after removals.

  • entity_removal.regex_patterns: Optional regex patterns applied after NER.

  • entity_removal.regex_replace_with: Replacement text for regex removals.

Lexical processing

  • lexical_processing.enabled: Enable normalization.

  • lexical_processing.lowercase: Lowercase text before tokenization.

  • lexical_processing.strip_punctuation: Remove punctuation before tokenization.

  • lexical_processing.collapse_whitespace: Normalize repeated whitespace.

BERTopic configuration

  • bertopic_analysis.parameters: Mapping of BERTopic constructor parameters.

  • bertopic_analysis.vectorizer.ngram_range: Inclusive n-gram range (for example [1, 2]).

  • bertopic_analysis.vectorizer.stop_words: english or a list of stop words. Set to null to disable.

LLM fine-tuning

  • llm_fine_tuning.enabled: Enable LLM topic labeling.

  • llm_fine_tuning.client: LLM client configuration.

  • llm_fine_tuning.prompt_template: Prompt template containing {keywords} and {documents}.

  • llm_fine_tuning.system_prompt: Optional system prompt.

  • llm_fine_tuning.max_keywords: Maximum keywords included per prompt.

  • llm_fine_tuning.max_documents: Maximum documents included per prompt.

Vectorizer configuration

Biblicus forwards BERTopic configuration through bertopic_analysis.parameters and exposes vectorizer settings through bertopic_analysis.vectorizer. To include bigrams, set ngram_range to [1, 2]. To remove stop words, set stop_words to english or a list.

bertopic_analysis:
  parameters:
    min_topic_size: 10
    nr_topics: 12
  vectorizer:
    ngram_range: [1, 2]
    stop_words: english

Repeatable integration script

The integration script downloads AG News, runs extraction, and then runs topic modeling with the selected parameters. It prints a summary with the analysis snapshot identifier and the output path.

python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force

Example: raise topic count

python scripts/topic_modeling_integration.py \
  --corpus corpora/ag_news_demo \
  --force \
  --limit 10000 \
  --vectorizer-ngram-min 1 \
  --vectorizer-ngram-max 2 \
  --bertopic-param nr_topics=8 \
  --bertopic-param min_topic_size=2

Example: disable lexical processing and restrict inputs

python scripts/topic_modeling_integration.py \
  --corpus corpora/ag_news_demo \
  --force \
  --sample-size 200 \
  --min-text-characters 200 \
  --no-lexical-enabled

Example: keep lexical processing but preserve punctuation

python scripts/topic_modeling_integration.py \
  --corpus corpora/ag_news_demo \
  --force \
  --no-lexical-strip-punctuation

BERTopic parameters are passed directly to the constructor. Use repeated --bertopic-param key=value pairs for multiple parameters. Values that look like JSON objects or arrays are parsed as JSON.

The integration script requires at least 16 documents to avoid BERTopic default UMAP errors. Increase --limit or use a larger corpus if you receive a small-corpus error.

AG News downloads require the datasets dependency. Install with:

python -m pip install "biblicus[datasets,topic-modeling]"

Tuning workflow

Start with a small sample to validate the pipeline, then scale up:

  1. Run with --limit 500 to validate extraction and output structure.

  2. Add bigrams and stop words to reduce noise in keyword lists.

  3. Increase --limit or --sample-size once topics look stable.

  4. Experiment with nr_topics and min_topic_size to control granularity.

Interpreting results

When a topic looks off, inspect the document_examples and compare them to the keyword list. If the documents do not match the keywords, adjust lexical processing or increase min_topic_size to reduce noise.

Common pitfalls

  • Using too few documents for BERTopic defaults (aim for at least 16).

  • Forgetting to enable stop words and ending up with filler topics.

  • Comparing runs that used different extraction prompts or lexical settings.