plexus.analysis.topics.transformer module

Module for transforming call transcripts into BERTopic-compatible format.

class plexus.analysis.topics.transformer.SimpleTranscriptItems(*, items: List[str])

Bases: BaseModel

Model for a simple list of strings extracted from a transcript.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

items: List[str]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class plexus.analysis.topics.transformer.TranscriptItem(*, question: str, category: str = 'OTHER')

Bases: BaseModel

Model for a single item extracted from a transcript.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

category: str

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

question: str

class plexus.analysis.topics.transformer.TranscriptItems(*, items: List[TranscriptItem])

Bases: BaseModel

Model for a list of items extracted from a transcript.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

items: List[TranscriptItem]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

plexus.analysis.topics.transformer.apply_customer_only_filter(df: DataFrame, content_column: str, customer_only: bool) → DataFrame

Apply customer-only filter to a DataFrame of transcripts if requested.

Args:: df: DataFrame containing transcript data content_column: Name of column containing transcript content customer_only: Whether to filter for customer utterances only
Returns:: DataFrame with filtered content if customer_only is True, otherwise original DataFrame

plexus.analysis.topics.transformer.extract_customer_only(text: str) → str

Extract only the customer utterances from transcript text.

Args:: text: Raw transcript text
Returns:: String containing only customer utterances concatenated together

plexus.analysis.topics.transformer.extract_speaking_turns(text: str) → List[str]

Extract customer speaking turns from transcript text.

Args:: text: Raw transcript text
Returns:: List of customer speaking turns

plexus.analysis.topics.transformer.inspect_data(df: DataFrame, content_column: str, num_samples: int = 5) → None

Print sample content from the DataFrame for inspection.

Args:: df: DataFrame containing transcript data content_column: Name of column containing transcript content num_samples: Number of samples to print

plexus.analysis.topics.transformer.transform_transcripts(input_file: str, content_column: str = 'content', customer_only: bool = False, fresh: bool = False, inspect: bool = True, sample_size: int | None = None) → Tuple[str, str, Dict[str, Any], DataFrame | None]

Transform transcript data into BERTopic-compatible format.

Args:: input_file: Path to input Parquet file content_column: Name of column containing transcript content customer_only: Whether to filter for customer utterances only fresh: Whether to force regeneration of cached files inspect: Whether to print sample data for inspection sample_size: Number of transcripts to sample from the dataset
Returns:: Tuple of (cached_parquet_path, text_file_path, preprocessing_info, transformed_df)

async plexus.analysis.topics.transformer.transform_transcripts_itemize(input_file: str, content_column: str = 'content', prompt_template_file: str = None, prompt_template: str = None, model: str = 'gemma3:27b', provider: str = 'ollama', customer_only: bool = False, fresh: bool = False, inspect: bool = True, max_retries: int = 2, simple_format: bool = True, retry_delay: float = 1.0, openai_api_key: str = None, sample_size: int | None = None, max_workers: int | None = None) → Tuple[str, str, Dict[str, Any], DataFrame | None]

Transform transcript data using a language model with itemization.

This function processes each transcript through a language model to extract structured items, creating multiple rows per transcript.

Args:: input_file: Path to input Parquet file content_column: Name of column containing transcript content prompt_template_file: Path to LangChain prompt template file (JSON) model: Model to use for transformation (depends on provider) provider: LLM provider to use (‘ollama’ or ‘openai’) customer_only: Whether to filter for customer utterances only fresh: Whether to force regeneration of cached files inspect: Whether to print sample data for inspection max_retries: Maximum number of retries for parsing failures retry_delay: Delay between retries in seconds openai_api_key: OpenAI API key (if provider is ‘openai’) sample_size: Number of transcripts to sample from the dataset
Returns:: Tuple of (cached_parquet_path, text_file_path, preprocessing_info, transformed_df)

async plexus.analysis.topics.transformer.transform_transcripts_llm(input_file: str, content_column: str = 'content', prompt_template_file: str = None, prompt_template: str = None, model: str = 'gemma3:27b', provider: str = 'ollama', customer_only: bool = False, fresh: bool = False, inspect: bool = True, openai_api_key: str = None, sample_size: int | None = None) → Tuple[str, str, Dict[str, Any], DataFrame | None]

Transform transcript data using a language model.

This function processes each transcript through a language model to extract key information or summarize content before topic analysis.

Args:: input_file: Path to input Parquet file content_column: Name of column containing transcript content prompt_template_file: Path to LangChain prompt template file (JSON) model: Model to use for transformation (depends on provider) provider: LLM provider to use (‘ollama’ or ‘openai’) customer_only: Whether to filter for customer utterances only fresh: Whether to force regeneration of cached files inspect: Whether to print sample data for inspection openai_api_key: OpenAI API key (if provider is ‘openai’) sample_size: Number of transcripts to sample from the dataset
Returns:: Tuple of (cached_parquet_path, text_file_path, preprocessing_info, transformed_df)