plexus.analysis.topics package

Topic analysis module for Plexus.

This module contains tools for analyzing topics in text data, including BERTopic implementation for topic modeling on call transcripts and other text sources.

plexus.analysis.topics.analyze_topics(*args, **kwargs): Lazy wrapper for analyze_topics to avoid loading PyTorch unless needed.

plexus.analysis.topics.inspect_data(df: DataFrame, content_column: str, num_samples: int = 5) → None

Print sample content from the DataFrame for inspection.

Args:: df: DataFrame containing transcript data content_column: Name of column containing transcript content num_samples: Number of samples to print

plexus.analysis.topics.test_ollama_chat(model: str = 'gemma3:27b', prompt: str = 'Why is the sky blue?', additional_params: Dict[str, Any] | None = None) → str

Test Ollama LLM chat functionality.

Args:: model: The model to use, defaults to ‘gemma3:27b’ prompt: The prompt to send to the model, defaults to ‘Why is the sky blue?’ additional_params: Additional parameters to pass to the Ollama API
Returns:: The response content from the model

plexus.analysis.topics.transform_transcripts(input_file: str, content_column: str = 'content', customer_only: bool = False, fresh: bool = False, inspect: bool = True, sample_size: int | None = None) → Tuple[str, str, Dict[str, Any], DataFrame | None]

Transform transcript data into BERTopic-compatible format.

Args:: input_file: Path to input Parquet file content_column: Name of column containing transcript content customer_only: Whether to filter for customer utterances only fresh: Whether to force regeneration of cached files inspect: Whether to print sample data for inspection sample_size: Number of transcripts to sample from the dataset
Returns:: Tuple of (cached_parquet_path, text_file_path, preprocessing_info, transformed_df)

async plexus.analysis.topics.transform_transcripts_itemize(input_file: str, content_column: str = 'content', prompt_template_file: str = None, prompt_template: str = None, model: str = 'gemma3:27b', provider: str = 'ollama', customer_only: bool = False, fresh: bool = False, inspect: bool = True, max_retries: int = 2, simple_format: bool = True, retry_delay: float = 1.0, openai_api_key: str = None, sample_size: int | None = None, max_workers: int | None = None) → Tuple[str, str, Dict[str, Any], DataFrame | None]

Transform transcript data using a language model with itemization.

This function processes each transcript through a language model to extract structured items, creating multiple rows per transcript.

Args:: input_file: Path to input Parquet file content_column: Name of column containing transcript content prompt_template_file: Path to LangChain prompt template file (JSON) model: Model to use for transformation (depends on provider) provider: LLM provider to use (‘ollama’ or ‘openai’) customer_only: Whether to filter for customer utterances only fresh: Whether to force regeneration of cached files inspect: Whether to print sample data for inspection max_retries: Maximum number of retries for parsing failures retry_delay: Delay between retries in seconds openai_api_key: OpenAI API key (if provider is ‘openai’) sample_size: Number of transcripts to sample from the dataset
Returns:: Tuple of (cached_parquet_path, text_file_path, preprocessing_info, transformed_df)

async plexus.analysis.topics.transform_transcripts_llm(input_file: str, content_column: str = 'content', prompt_template_file: str = None, prompt_template: str = None, model: str = 'gemma3:27b', provider: str = 'ollama', customer_only: bool = False, fresh: bool = False, inspect: bool = True, openai_api_key: str = None, sample_size: int | None = None) → Tuple[str, str, Dict[str, Any], DataFrame | None]

Transform transcript data using a language model.

This function processes each transcript through a language model to extract key information or summarize content before topic analysis.

Args:: input_file: Path to input Parquet file content_column: Name of column containing transcript content prompt_template_file: Path to LangChain prompt template file (JSON) model: Model to use for transformation (depends on provider) provider: LLM provider to use (‘ollama’ or ‘openai’) customer_only: Whether to filter for customer utterances only fresh: Whether to force regeneration of cached files inspect: Whether to print sample data for inspection openai_api_key: OpenAI API key (if provider is ‘openai’) sample_size: Number of transcripts to sample from the dataset
Returns:: Tuple of (cached_parquet_path, text_file_path, preprocessing_info, transformed_df)

plexus.analysis.topics package

Submodules