plexus.analysis.topics package

Topic analysis module for Plexus.

This module contains tools for analyzing topics in text data, including BERTopic implementation for topic modeling on call transcripts and other text sources.

plexus.analysis.topics.analyze_topics(*args, **kwargs)

Lazy wrapper for analyze_topics to avoid loading PyTorch unless needed.

plexus.analysis.topics.inspect_data(df: DataFrame, content_column: str, num_samples: int = 5) None

Print sample content from the DataFrame for inspection.

Args:

df: DataFrame containing transcript data content_column: Name of column containing transcript content num_samples: Number of samples to print

plexus.analysis.topics.test_ollama_chat(model: str = 'gemma3:27b', prompt: str = 'Why is the sky blue?', additional_params: Dict[str, Any] | None = None) str

Test Ollama LLM chat functionality.

Args:

model: The model to use, defaults to ‘gemma3:27b’ prompt: The prompt to send to the model, defaults to ‘Why is the sky blue?’ additional_params: Additional parameters to pass to the Ollama API

Returns:

The response content from the model

plexus.analysis.topics.transform_transcripts(input_file: str, content_column: str = 'content', customer_only: bool = False, fresh: bool = False, inspect: bool = True, sample_size: int | None = None) Tuple[str, str, Dict[str, Any], DataFrame | None]

Transform transcript data into BERTopic-compatible format.

Args:

input_file: Path to input Parquet file content_column: Name of column containing transcript content customer_only: Whether to filter for customer utterances only fresh: Whether to force regeneration of cached files inspect: Whether to print sample data for inspection sample_size: Number of transcripts to sample from the dataset

Returns:

Tuple of (cached_parquet_path, text_file_path, preprocessing_info, transformed_df)

async plexus.analysis.topics.transform_transcripts_itemize(input_file: str, content_column: str = 'content', prompt_template_file: str = None, prompt_template: str = None, model: str = 'gemma3:27b', provider: str = 'ollama', customer_only: bool = False, fresh: bool = False, inspect: bool = True, max_retries: int = 2, simple_format: bool = True, retry_delay: float = 1.0, openai_api_key: str = None, sample_size: int | None = None, max_workers: int | None = None) Tuple[str, str, Dict[str, Any], DataFrame | None]

Transform transcript data using a language model with itemization.

This function processes each transcript through a language model to extract structured items, creating multiple rows per transcript.

Args:

input_file: Path to input Parquet file content_column: Name of column containing transcript content prompt_template_file: Path to LangChain prompt template file (JSON) model: Model to use for transformation (depends on provider) provider: LLM provider to use (‘ollama’ or ‘openai’) customer_only: Whether to filter for customer utterances only fresh: Whether to force regeneration of cached files inspect: Whether to print sample data for inspection max_retries: Maximum number of retries for parsing failures retry_delay: Delay between retries in seconds openai_api_key: OpenAI API key (if provider is ‘openai’) sample_size: Number of transcripts to sample from the dataset

Returns:

Tuple of (cached_parquet_path, text_file_path, preprocessing_info, transformed_df)

async plexus.analysis.topics.transform_transcripts_llm(input_file: str, content_column: str = 'content', prompt_template_file: str = None, prompt_template: str = None, model: str = 'gemma3:27b', provider: str = 'ollama', customer_only: bool = False, fresh: bool = False, inspect: bool = True, openai_api_key: str = None, sample_size: int | None = None) Tuple[str, str, Dict[str, Any], DataFrame | None]

Transform transcript data using a language model.

This function processes each transcript through a language model to extract key information or summarize content before topic analysis.

Args:

input_file: Path to input Parquet file content_column: Name of column containing transcript content prompt_template_file: Path to LangChain prompt template file (JSON) model: Model to use for transformation (depends on provider) provider: LLM provider to use (‘ollama’ or ‘openai’) customer_only: Whether to filter for customer utterances only fresh: Whether to force regeneration of cached files inspect: Whether to print sample data for inspection openai_api_key: OpenAI API key (if provider is ‘openai’) sample_size: Number of transcripts to sample from the dataset

Returns:

Tuple of (cached_parquet_path, text_file_path, preprocessing_info, transformed_df)

Submodules