Topic modeling
Biblicus provides a topic modeling analysis backend that reads extracted text artifacts, optionally applies an LLM extraction pass, optionally removes named entities, applies lexical processing, runs BERTopic, and optionally applies an LLM fine-tuning pass for labels. The output is structured JavaScript Object Notation with explicit per-topic evidence.
What topic modeling does
Topic modeling groups documents into clusters based on shared terms or phrases, then surfaces representative keywords for each cluster. It is a fast way to summarize large corpora, identify dominant themes, and spot outliers without manual labeling. The output is not a classifier; it is an exploratory tool that produces evidence that can be inspected or reviewed by humans.
About BERTopic
BERTopic combines document embeddings with clustering and a class-based term frequency approach to extract topic keywords. Biblicus supports BERTopic as an optional dependency and forwards its configuration parameters directly to the BERTopic constructor. This allows you to tune clustering behavior while keeping the output in a consistent schema.
Pipeline stages
Text collection reads extracted text artifacts from an extraction snapshot.
LLM extraction optionally transforms each document into one or more analysis documents.
Entity removal optionally deletes named entities before modeling.
Lexical processing optionally normalizes text before BERTopic.
BERTopic produces topic assignments and keyword weights.
LLM fine-tuning optionally replaces topic labels based on sampled documents.
Run topic modeling from the CLI
biblicus analyze topics --corpus corpora/example --configuration configurations/topic-modeling.yml --extraction-run pipeline:RUN_ID
Topic modeling configurations support cascading composition. Pass multiple --configuration files; later configurations override earlier
configurations via a deep merge:
biblicus analyze topics \
--corpus corpora/example \
--configuration configurations/topic-modeling/base.yml \
--configuration configurations/topic-modeling/ag-news.yml \
--extraction-run pipeline:RUN_ID
To override the composed configuration view from the command line, use --config key=value with dotted keys:
biblicus analyze topics \
--corpus corpora/example \
--configuration configurations/topic-modeling/base.yml \
--configuration configurations/topic-modeling/ag-news.yml \
--config bertopic_analysis.parameters.nr_topics=12 \
--extraction-run pipeline:RUN_ID
If you omit --extraction-run, Biblicus uses the latest extraction snapshot and emits a reproducibility warning.
Output structure
Topic modeling writes a single output.json file under the analysis snapshot directory. The output contains:
run.snapshot_idandrun.statsfor reproducible tracking.report.topicswith the modeled topics.report.text_collection,report.llm_extraction,report.entity_removal,report.lexical_processing,report.bertopic_analysis, andreport.llm_fine_tuningdescribing each pipeline stage.
When entity removal is enabled, Biblicus also writes entity_removal.jsonl alongside output.json. This artifact
contains the redacted documents that feed BERTopic and is reused on subsequent runs for the same snapshot.
Each topic record includes:
topic_id: The BERTopic topic identifier. The outlier topic uses-1.label: The human-readable label.label_source:bertopicorllmdepending on the stage that set the label.keywords: Keyword list with weights.document_count: Number of documents assigned to the topic.document_ids: Item identifiers for the assigned documents.document_examples: Sampled document text used for inspection.
Per-topic behavior is determined by the BERTopic assignments and the optional fine-tuning stage. The lexical
processing flags can substantially change tokenization and therefore the resulting topic labels. The outlier
topic_id -1 indicates documents that BERTopic could not confidently assign to a cluster.
Reading a topic record
Each topic includes evidence you can inspect. A shortened example:
{
"topic_id": 2,
"label": "global markets and stocks",
"label_source": "bertopic",
"keywords": [
{"keyword": "stocks", "weight": 0.42},
{"keyword": "market", "weight": 0.37}
],
"document_count": 124,
"document_examples": [
"Stocks climbed after the earnings report ...",
"Markets opened higher as investors ..."
]
}
Use document_examples as a sanity check, then trace document_ids back to the corpus for deeper inspection.
Configuration reference
Topic modeling configurations use a strict schema. Unknown fields or type mismatches are errors.
Text source
text_source.sample_size: Limit the number of documents used for analysis.text_source.min_text_characters: Drop documents shorter than this count.
LLM extraction
llm_extraction.enabled: Enable the LLM extraction stage.llm_extraction.method:singleoritemizeto control whether an input maps to one or many documents.llm_extraction.client: LLM client configuration (requiresbiblicus[openai]).llm_extraction.prompt_template: Prompt template for the extraction stage.llm_extraction.system_prompt: Optional system prompt.
Entity removal
entity_removal.enabled: Enable local named-entity removal.entity_removal.provider:spacy(required).entity_removal.model: spaCy model name (for exampleen_core_web_sm).entity_removal.entity_types: Entity labels to remove. Empty uses defaults.entity_removal.replace_with: Replacement text inserted for removed entities.entity_removal.collapse_whitespace: Normalize whitespace after removals.entity_removal.regex_patterns: Optional regex patterns applied after NER.entity_removal.regex_replace_with: Replacement text for regex removals.
Lexical processing
lexical_processing.enabled: Enable normalization.lexical_processing.lowercase: Lowercase text before tokenization.lexical_processing.strip_punctuation: Remove punctuation before tokenization.lexical_processing.collapse_whitespace: Normalize repeated whitespace.
BERTopic configuration
bertopic_analysis.parameters: Mapping of BERTopic constructor parameters.bertopic_analysis.vectorizer.ngram_range: Inclusive n-gram range (for example[1, 2]).bertopic_analysis.vectorizer.stop_words:englishor a list of stop words. Set tonullto disable.
LLM fine-tuning
llm_fine_tuning.enabled: Enable LLM topic labeling.llm_fine_tuning.client: LLM client configuration.llm_fine_tuning.prompt_template: Prompt template containing{keywords}and{documents}.llm_fine_tuning.system_prompt: Optional system prompt.llm_fine_tuning.max_keywords: Maximum keywords included per prompt.llm_fine_tuning.max_documents: Maximum documents included per prompt.
Vectorizer configuration
Biblicus forwards BERTopic configuration through bertopic_analysis.parameters and exposes vectorizer settings
through bertopic_analysis.vectorizer. To include bigrams, set ngram_range to [1, 2]. To remove stop words,
set stop_words to english or a list.
bertopic_analysis:
parameters:
min_topic_size: 10
nr_topics: 12
vectorizer:
ngram_range: [1, 2]
stop_words: english
Repeatable integration script
The integration script downloads AG News, runs extraction, and then runs topic modeling with the selected parameters. It prints a summary with the analysis snapshot identifier and the output path.
python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
Example: raise topic count
python scripts/topic_modeling_integration.py \
--corpus corpora/ag_news_demo \
--force \
--limit 10000 \
--vectorizer-ngram-min 1 \
--vectorizer-ngram-max 2 \
--bertopic-param nr_topics=8 \
--bertopic-param min_topic_size=2
Example: disable lexical processing and restrict inputs
python scripts/topic_modeling_integration.py \
--corpus corpora/ag_news_demo \
--force \
--sample-size 200 \
--min-text-characters 200 \
--no-lexical-enabled
Example: keep lexical processing but preserve punctuation
python scripts/topic_modeling_integration.py \
--corpus corpora/ag_news_demo \
--force \
--no-lexical-strip-punctuation
BERTopic parameters are passed directly to the constructor. Use repeated --bertopic-param key=value pairs for
multiple parameters. Values that look like JSON objects or arrays are parsed as JSON.
The integration script requires at least 16 documents to avoid BERTopic default UMAP errors. Increase --limit or
use a larger corpus if you receive a small-corpus error.
AG News downloads require the datasets dependency. Install with:
python -m pip install "biblicus[datasets,topic-modeling]"
Tuning workflow
Start with a small sample to validate the pipeline, then scale up:
Run with
--limit 500to validate extraction and output structure.Add bigrams and stop words to reduce noise in keyword lists.
Increase
--limitor--sample-sizeonce topics look stable.Experiment with
nr_topicsandmin_topic_sizeto control granularity.
Interpreting results
When a topic looks off, inspect the document_examples and compare them to the keyword list. If the documents do not
match the keywords, adjust lexical processing or increase min_topic_size to reduce noise.
Common pitfalls
Using too few documents for BERTopic defaults (aim for at least 16).
Forgetting to enable stop words and ending up with filler topics.
Comparing runs that used different extraction prompts or lexical settings.