plexus.analysis.topics.analyzer module

Module for performing BERTopic analysis on transformed transcripts.

plexus.analysis.topics.analyzer.analyze_topics(text_file_path: str, output_dir: str, nr_topics: int | None = None, n_gram_range: Tuple[int, int] = (1, 2), min_topic_size: int = 10, top_n_words: int = 10, use_representation_model: bool = True, openai_api_key: str | None = None, representation_model_provider: str = 'openai', representation_model_name: str = 'gpt-4o-mini', transformed_df: DataFrame | None = None, prompt: str | None = None, system_prompt: str | None = None, force_single_representation: bool = True, nr_docs: int = 100, diversity: float = 0.1, doc_length: int = 500, tokenizer: str = 'whitespace', remove_stop_words: bool = False, stop_words_languages: str | List[str] | None = None, custom_stop_words: List[str] | None = None, min_df: int = 1, max_ngrams_per_topic: int = 100, compute_stability: bool = False, stability_n_runs: int = 10, stability_sample_fraction: float = 0.8, compute_hierarchical: bool = False, hierarchical_linkage: str = 'average', hierarchical_orientation: str = 'left') → Dict[str, Any] | None

Perform BERTopic analysis on transformed transcripts.

Args:

text_file_path: Path to text file containing speaking turns output_dir: Directory to save analysis results nr_topics: Target number of topics after reduction (default: None, no reduction) n_gram_range: The lower and upper boundary of the n-gram range (default: (1, 2)) min_topic_size: Minimum size of topics (default: 10) top_n_words: Number of words per topic (default: 10) use_representation_model: Whether to use LLM for better topic naming (default: True) openai_api_key: OpenAI API key for representation model (default: None, uses env var) representation_model_provider: LLM provider for topic naming (default: “openai”) representation_model_name: Specific model name for topic naming (default: “gpt-4o-mini”) transformed_df: DataFrame with transformed data including ids column prompt: Custom prompt for topic naming (user prompt) system_prompt: Custom system prompt for topic naming context force_single_representation: Use only one representation model to avoid duplicate titles nr_docs: Number of representative documents to select per topic (default: 100) diversity: Diversity factor for document selection, 0-1 (default: 0.1) doc_length: Maximum characters per document (default: 500) tokenizer: Tokenization method for documents (default: “whitespace”) remove_stop_words: Whether to remove stop words from topics (default: False) stop_words_languages: Language(s) for stop words - supports any NLTK language(s)

Can be a string (“english”) or list ([“english”, “spanish”]) Default: None (uses [“english”, “spanish”] when remove_stop_words=True) Available: english, spanish, french, german, portuguese, italian, etc.

custom_stop_words: Optional list of additional stop words to remove (default: None) min_df: Minimum document frequency for terms (default: 1) max_ngrams_per_topic: Maximum n-grams to export per topic (default: 100) compute_stability: Whether to assess topic stability (default: False) stability_n_runs: Number of bootstrap runs for stability (default: 10) stability_sample_fraction: Fraction of data to sample per run (default: 0.8) compute_hierarchical: Whether to generate hierarchical topic structure (default: False) hierarchical_linkage: Linkage method for hierarchical clustering (default: “average”) hierarchical_orientation: Orientation for hierarchy visualization (default: “left”)

Returns:

Dict with keys: ‘topic_model’, ‘topic_info’, ‘topics’, ‘docs’,: ‘topic_similarity_metrics’ (optional), ‘topic_stability’ (optional)

Returns None if analysis fails

plexus.analysis.topics.analyzer.create_topics_per_class_visualization(topic_model, topics, docs, output_dir=None)

Create a visualization showing topic distribution across different classes.

Args:: topic_model: Fitted BERTopic model (Any) topics: List of topic assignments docs: List of documents output_dir: Directory to save the visualization
Returns:: Path to saved visualization or None if generation fails

plexus.analysis.topics.analyzer.ensure_directory(path: str) → None: Create directory with appropriate permissions if it doesn’t exist.

plexus.analysis.topics.analyzer.save_complete_topic_ngrams(topic_model, output_dir: str, max_ngrams_per_topic: int = 100) → str

Save complete n-gram lists with c-TF-IDF scores for all topics.

Args:: topic_model: Fitted BERTopic model output_dir: Directory to save the CSV file max_ngrams_per_topic: Maximum number of n-grams to save per topic (default: 100)
Returns:: Path to the saved CSV file

plexus.analysis.topics.analyzer.save_topic_info(topic_model, output_dir: str, docs: List[str], topics: List[int]) → None: Save topic information to JSON files.

plexus.analysis.topics.analyzer.save_visualization(fig, filepath: str) → None: Save visualization with error handling.