plexus.analysis.topics.transformer module
Module for transforming call transcripts into BERTopic-compatible format.
- class plexus.analysis.topics.transformer.SimpleTranscriptItems(*, items: List[str])
Bases:
BaseModelModel for a simple list of strings extracted from a transcript.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- items: List[str]
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class plexus.analysis.topics.transformer.TranscriptItem(*, question: str, category: str = 'OTHER')
Bases:
BaseModelModel for a single item extracted from a transcript.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- category: str
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- question: str
- class plexus.analysis.topics.transformer.TranscriptItems(*, items: List[TranscriptItem])
Bases:
BaseModelModel for a list of items extracted from a transcript.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- items: List[TranscriptItem]
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- plexus.analysis.topics.transformer.apply_customer_only_filter(df: DataFrame, content_column: str, customer_only: bool) DataFrame
Apply customer-only filter to a DataFrame of transcripts if requested.
- Args:
df: DataFrame containing transcript data content_column: Name of column containing transcript content customer_only: Whether to filter for customer utterances only
- Returns:
DataFrame with filtered content if customer_only is True, otherwise original DataFrame
- plexus.analysis.topics.transformer.extract_customer_only(text: str) str
Extract only the customer utterances from transcript text.
- Args:
text: Raw transcript text
- Returns:
String containing only customer utterances concatenated together
- plexus.analysis.topics.transformer.extract_speaking_turns(text: str) List[str]
Extract customer speaking turns from transcript text.
- Args:
text: Raw transcript text
- Returns:
List of customer speaking turns
- plexus.analysis.topics.transformer.inspect_data(df: DataFrame, content_column: str, num_samples: int = 5) None
Print sample content from the DataFrame for inspection.
- Args:
df: DataFrame containing transcript data content_column: Name of column containing transcript content num_samples: Number of samples to print
- plexus.analysis.topics.transformer.transform_transcripts(input_file: str, content_column: str = 'content', customer_only: bool = False, fresh: bool = False, inspect: bool = True, sample_size: int | None = None) Tuple[str, str, Dict[str, Any], DataFrame | None]
Transform transcript data into BERTopic-compatible format.
- Args:
input_file: Path to input Parquet file content_column: Name of column containing transcript content customer_only: Whether to filter for customer utterances only fresh: Whether to force regeneration of cached files inspect: Whether to print sample data for inspection sample_size: Number of transcripts to sample from the dataset
- Returns:
Tuple of (cached_parquet_path, text_file_path, preprocessing_info, transformed_df)
- async plexus.analysis.topics.transformer.transform_transcripts_itemize(input_file: str, content_column: str = 'content', prompt_template_file: str = None, prompt_template: str = None, model: str = 'gemma3:27b', provider: str = 'ollama', customer_only: bool = False, fresh: bool = False, inspect: bool = True, max_retries: int = 2, simple_format: bool = True, retry_delay: float = 1.0, openai_api_key: str = None, sample_size: int | None = None, max_workers: int | None = None) Tuple[str, str, Dict[str, Any], DataFrame | None]
Transform transcript data using a language model with itemization.
This function processes each transcript through a language model to extract structured items, creating multiple rows per transcript.
- Args:
input_file: Path to input Parquet file content_column: Name of column containing transcript content prompt_template_file: Path to LangChain prompt template file (JSON) model: Model to use for transformation (depends on provider) provider: LLM provider to use (‘ollama’ or ‘openai’) customer_only: Whether to filter for customer utterances only fresh: Whether to force regeneration of cached files inspect: Whether to print sample data for inspection max_retries: Maximum number of retries for parsing failures retry_delay: Delay between retries in seconds openai_api_key: OpenAI API key (if provider is ‘openai’) sample_size: Number of transcripts to sample from the dataset
- Returns:
Tuple of (cached_parquet_path, text_file_path, preprocessing_info, transformed_df)
- async plexus.analysis.topics.transformer.transform_transcripts_llm(input_file: str, content_column: str = 'content', prompt_template_file: str = None, prompt_template: str = None, model: str = 'gemma3:27b', provider: str = 'ollama', customer_only: bool = False, fresh: bool = False, inspect: bool = True, openai_api_key: str = None, sample_size: int | None = None) Tuple[str, str, Dict[str, Any], DataFrame | None]
Transform transcript data using a language model.
This function processes each transcript through a language model to extract key information or summarize content before topic analysis.
- Args:
input_file: Path to input Parquet file content_column: Name of column containing transcript content prompt_template_file: Path to LangChain prompt template file (JSON) model: Model to use for transformation (depends on provider) provider: LLM provider to use (‘ollama’ or ‘openai’) customer_only: Whether to filter for customer utterances only fresh: Whether to force regeneration of cached files inspect: Whether to print sample data for inspection openai_api_key: OpenAI API key (if provider is ‘openai’) sample_size: Number of transcripts to sample from the dataset
- Returns:
Tuple of (cached_parquet_path, text_file_path, preprocessing_info, transformed_df)