plexus.confidence_calibration module
Confidence calibration utilities for Plexus evaluations.
Implements isotonic regression calibration as described in: https://github.com/AnthusAI/Classification-with-Confidence
- plexus.confidence_calibration.apply_calibration_from_serialized(raw_confidence: float, calibration_data: Dict[str, Any]) float
Apply calibration using serialized calibration data.
- Args:
raw_confidence: Raw confidence score (0.0 to 1.0) calibration_data: Serialized calibration data
- Returns:
Calibrated confidence score
- plexus.confidence_calibration.apply_temperature_scaling(confidence_scores: List[float], temperature: float) List[float]
Apply temperature scaling to confidence scores.
Temperature scaling applies a single parameter T to logits before applying softmax: P_calibrated = softmax(logit / T)
For confidence scores (already probabilities), we convert back to logits, apply temperature scaling, then convert back to probabilities.
- Args:
confidence_scores: List of raw confidence scores (0.0 to 1.0) temperature: Temperature parameter (T > 1 decreases confidence, T < 1 increases)
- Returns:
List of temperature-scaled confidence scores
- plexus.confidence_calibration.compute_isotonic_regression_calibration(confidence_scores: List[float], accuracy_labels: List[int]) IsotonicRegression | None
Compute isotonic regression calibration curve.
- Args:
confidence_scores: List of raw confidence scores (0.0 to 1.0) accuracy_labels: List of binary accuracy labels (1 for correct, 0 for incorrect)
- Returns:
Fitted IsotonicRegression model, or None if insufficient data
- plexus.confidence_calibration.compute_two_stage_calibration(confidence_scores: List[float], accuracy_labels: List[int]) Tuple[float, IsotonicRegression | None, List[float]]
Compute two-stage calibration: temperature scaling followed by isotonic regression.
- Args:
confidence_scores: List of raw confidence scores (0.0 to 1.0) accuracy_labels: List of binary accuracy labels (1 for correct, 0 for incorrect)
- Returns:
Tuple of (optimal_temperature, isotonic_model, temperature_scaled_scores)
- plexus.confidence_calibration.detect_confidence_feature_enabled(evaluation_results: List[Dict[str, Any]]) bool
Detect if the confidence feature is enabled in evaluation results.
- Args:
evaluation_results: List of evaluation results
- Returns:
True if confidence values are present, False otherwise
- plexus.confidence_calibration.expected_calibration_error(confidences: ndarray, accuracies: ndarray, n_bins: int = 10) float
Calculate Expected Calibration Error (ECE).
ECE measures the average difference between confidence and accuracy across different confidence bins. Lower is better (0 = perfect calibration).
- Args:
confidences: Array of confidence scores [0, 1] accuracies: Array of binary correctness [0, 1] n_bins: Number of confidence bins to use
- Returns:
ECE score (lower is better, 0 = perfect calibration)
- plexus.confidence_calibration.extract_confidence_accuracy_pairs(evaluation_results: List[Dict[str, Any]]) Tuple[List[float], List[int]]
Extract confidence scores and accuracy labels from evaluation results.
- Args:
evaluation_results: List of evaluation results from Plexus
- Returns:
Tuple of (confidence_scores, accuracy_labels) where: - confidence_scores: List of confidence values (0.0 to 1.0) - accuracy_labels: List of binary accuracy labels (1 for correct, 0 for incorrect)
- plexus.confidence_calibration.find_optimal_temperature(confidence_scores: List[float], accuracy_labels: List[int], method: str = 'minimize_scalar') float
Find optimal temperature parameter to minimize Expected Calibration Error (ECE).
- Args:
confidence_scores: List of raw confidence scores (0.0 to 1.0) accuracy_labels: List of binary accuracy labels (1 for correct, 0 for incorrect) method: Optimization method - “minimize_scalar” or “grid_search”
- Returns:
Optimal temperature parameter
- plexus.confidence_calibration.generate_calibration_report(confidence_scores: List[float], accuracy_labels: List[int], calibration_model: IsotonicRegression | None = None) Dict[str, Any]
Generate a calibration analysis report.
- Args:
confidence_scores: Raw confidence scores accuracy_labels: Binary accuracy labels calibration_model: Optional fitted calibration model
- Returns:
Dictionary containing calibration metrics and analysis
- plexus.confidence_calibration.plot_reliability_diagram(confidence_scores: List[float], accuracy_labels: List[int], save_path: str, title: str = 'Confidence Calibration - Reliability Diagram', n_bins: int = 20, temperature_scaled_scores: List[float] = None, calibrated_confidence_scores: List[float] = None) None
Plot reliability diagram showing calibration quality with confidence buckets. Can show raw, temperature-scaled, and final calibrated confidence scores for comparison.
Perfect calibration appears as points on the diagonal line. Points above the line = overconfident, below = underconfident.
- Args:
confidence_scores: List of raw confidence scores (0.0 to 1.0) accuracy_labels: List of binary accuracy labels (1 for correct, 0 for incorrect) save_path: Path to save the plot image title: Plot title n_bins: Number of confidence bins (default 20 for 5% buckets) temperature_scaled_scores: Optional list of temperature-scaled confidence scores calibrated_confidence_scores: Optional list of final calibrated confidence scores
- plexus.confidence_calibration.serialize_calibration_model(calibration_model: IsotonicRegression) Dict[str, Any]
Serialize calibration model for storage in YAML/JSON.
- Args:
calibration_model: Fitted isotonic regression model
- Returns:
Dictionary containing serialized calibration data