plexus.analysis.topic_clusterer module

TopicClusterer: clusters pre-computed embeddings via BERTopic (UMAP + HDBSCAN).

Decouples embedding from clustering. Computes centroids, p95 boundaries, and LLM-generated labels. Used by VectorTopicMemory ReportBlock.

class plexus.analysis.topic_clusterer.TopicClusterer(min_topic_size: int = 10, umap_n_components: int = 5, umap_min_dist: float = 0.0, umap_metric: str = 'cosine', label_generator: Callable[[List[str]], str] | None = None)

Bases: object

Clusters pre-computed embeddings via BERTopic (UMAP + HDBSCAN). Computes centroids, p95 distance boundaries, and labels.

__init__(min_topic_size: int = 10, umap_n_components: int = 5, umap_min_dist: float = 0.0, umap_metric: str = 'cosine', label_generator: Callable[[List[str]], str] | None = None)

cluster(embeddings: ndarray, documents: List[str], min_topic_size: int | None = None, min_samples: int | None = None, cluster_selection_method: str = 'leaf', cluster_selection_epsilon: float = 0.5) → Tuple[ndarray, str]: Cluster embeddings via BERTopic. Returns (topic_ids, cluster_version). topic_ids: -1 for outliers.

cluster_boundaries() → Dict[int, float]: Return p95 cosine distance from members to centroid per cluster.

cluster_centroids() → Dict[int, ndarray]: Return centroid (mean of member embeddings) per non-outlier cluster.

generate_labels(clusters: Dict[int, Dict[str, Any]] | None = None) → Dict[int, str]: Generate human-readable labels per cluster via LLM (or mock).

get_cluster_records() → List[Dict[str, Any]]: Build cluster records for vector store persistence.

get_keywords(topic_id: int, n: int = 8) → List[str]: Extract top keywords for a cluster via TF-IDF on cluster documents.

get_representative_exemplars(topic_id: int, n: int = 5) → List[Tuple[int, str]]

Return (original_index, text) pairs for the n docs closest to the centroid.

The original_index is the position in the list passed to cluster(), allowing callers to map back to doc_ids, item IDs, or other per-document metadata.