plexus.analysis.topic_clusterer module
TopicClusterer: clusters pre-computed embeddings via BERTopic (UMAP + HDBSCAN).
Decouples embedding from clustering. Computes centroids, p95 boundaries, and LLM-generated labels. Used by VectorTopicMemory ReportBlock.
- class plexus.analysis.topic_clusterer.TopicClusterer(min_topic_size: int = 10, umap_n_components: int = 5, umap_min_dist: float = 0.0, umap_metric: str = 'cosine', label_generator: Callable[[List[str]], str] | None = None)
Bases:
objectClusters pre-computed embeddings via BERTopic (UMAP + HDBSCAN). Computes centroids, p95 distance boundaries, and labels.
- __init__(min_topic_size: int = 10, umap_n_components: int = 5, umap_min_dist: float = 0.0, umap_metric: str = 'cosine', label_generator: Callable[[List[str]], str] | None = None)
- cluster(embeddings: ndarray, documents: List[str], min_topic_size: int | None = None, min_samples: int | None = None, cluster_selection_method: str = 'leaf', cluster_selection_epsilon: float = 0.5) Tuple[ndarray, str]
Cluster embeddings via BERTopic. Returns (topic_ids, cluster_version). topic_ids: -1 for outliers.
- cluster_boundaries() Dict[int, float]
Return p95 cosine distance from members to centroid per cluster.
- cluster_centroids() Dict[int, ndarray]
Return centroid (mean of member embeddings) per non-outlier cluster.
- generate_labels(clusters: Dict[int, Dict[str, Any]] | None = None) Dict[int, str]
Generate human-readable labels per cluster via LLM (or mock).
- get_cluster_records() List[Dict[str, Any]]
Build cluster records for vector store persistence.
- get_keywords(topic_id: int, n: int = 8) List[str]
Extract top keywords for a cluster via TF-IDF on cluster documents.
- get_representative_exemplars(topic_id: int, n: int = 5) List[Tuple[int, str]]
Return (original_index, text) pairs for the n docs closest to the centroid.
The original_index is the position in the list passed to cluster(), allowing callers to map back to doc_ids, item IDs, or other per-document metadata.