plexus.Evaluation module
- class plexus.Evaluation.AccuracyEvaluation(*, override_folder: str | None = None, labeled_samples: list = None, labeled_samples_filename: str = None, score_id: str = None, score_version_id: str = None, visualize: bool = False, task_id: str = None, evaluation_id: str = None, account_id: str = None, account_key: str = None, scorecard_id: str = None, **kwargs)
Bases:
Evaluation- __init__(*, override_folder: str | None = None, labeled_samples: list = None, labeled_samples_filename: str = None, score_id: str = None, score_version_id: str = None, visualize: bool = False, task_id: str = None, evaluation_id: str = None, account_id: str = None, account_key: str = None, scorecard_id: str = None, **kwargs)
- get_score_instance(score_name: str)
Get a Score instance using the standardized Score.load() method.
This method now uses the DRY, tested Score.load() approach that handles: - API loading with local caching - YAML-only loading from local files - Proper error handling and dependency resolution
- async run(tracker=None, progress_callback=None, dry_run=False)
Run the accuracy evaluation.
tracker is optional to maintain backward compatibility with callers that did not pass a tracker. When None, stage/progress updates are skipped.
- class plexus.Evaluation.ConsistencyEvaluation(*, number_of_times_to_sample_each_text, **kwargs)
Bases:
Evaluation- __init__(*, number_of_times_to_sample_each_text, **kwargs)
- log_parameters()
- class plexus.Evaluation.Evaluation(*, scorecard_name: str, scorecard: Scorecard, labeled_samples_filename: str = None, labeled_samples: list = None, number_of_texts_to_sample=100, sampling_method='random', random_seed=None, session_ids_to_sample=None, subset_of_score_names=None, experiment_label=None, max_mismatches_to_report=5, account_key: str = None, score_id: str = None, visualize: bool = False, task_id: str = None, allow_no_labels: bool = False)
Bases:
objectBase class for evaluating Scorecard performance through accuracy testing and consistency checking.
Evaluation is used to measure how well a Scorecard performs against labeled data or to check consistency of results. It integrates with the Plexus dashboard for monitoring. The class supports:
Accuracy testing against labeled data
Consistency checking through repeated scoring
Automatic metrics calculation and visualization
Cost tracking and reporting
Real-time progress tracking in the dashboard
There are two main subclasses: 1. AccuracyEvaluation: Tests scorecard results against labeled data 2. ConsistencyEvaluation: Checks if scores are consistent when run multiple times
Common usage patterns: 1. Running an accuracy evaluation:
- evaluation = AccuracyEvaluation(
scorecard_name=”qa”, scorecard=scorecard, labeled_samples_filename=”labeled_data.csv”
) await evaluation.run()
- Running a consistency check:
- evaluation = ConsistencyEvaluation(
scorecard_name=”qa”, scorecard=scorecard, number_of_texts_to_sample=100, number_of_times_to_sample_each_text=3
) await evaluation.run()
- Using with context management:
- with evaluation:
await evaluation.run()
- Monitoring in dashboard:
- evaluation = AccuracyEvaluation(
scorecard_name=”qa”, scorecard=scorecard, account_key=”my-account”, score_id=”score-123”
) await evaluation.run() # Progress visible in dashboard
The Evaluation class is commonly used during model development to measure performance and during production to monitor for accuracy drift.
- __init__(*, scorecard_name: str, scorecard: Scorecard, labeled_samples_filename: str = None, labeled_samples: list = None, number_of_texts_to_sample=100, sampling_method='random', random_seed=None, session_ids_to_sample=None, subset_of_score_names=None, experiment_label=None, max_mismatches_to_report=5, account_key: str = None, score_id: str = None, visualize: bool = False, task_id: str = None, allow_no_labels: bool = False)
- calculate_metrics(results)
- async cleanup()
Clean up all resources
- async continuous_metrics_computation(score_name: str)
Background task that continuously computes and posts metrics for a specific score
- create_performance_visualization(results, question, report_folder_path)
- generate_and_log_confusion_matrix(results, report_folder_path)
- generate_csv_scorecard_report(*, results)
- generate_excel_report(report_folder_path, results, selected_sample_rows)
- generate_metrics_json(report_folder_path, sample_size, expenses, calibration_report=None)
- generate_report(score_instance, overall_accuracy, expenses, sample_size, final_metrics=None)
- static get_evaluation_info(evaluation_id: str, include_score_results: bool = False) dict
Get detailed information about an evaluation by its ID.
- Args:
evaluation_id: The ID of the evaluation to look up include_score_results: Whether to include score results in the response
- Returns:
dict: Evaluation information including scorecard name, score name, and metrics
- static get_latest_evaluation(account_key: str = None, evaluation_type: str = None) dict
Get information about the most recent evaluation.
- Args:
account_key: Account key to filter by (default: from PLEXUS_ACCOUNT_KEY env var) evaluation_type: Optional filter by evaluation type (e.g., ‘accuracy’)
- Returns:
dict: Latest evaluation information
- log_parameters()
- async log_to_dashboard(metrics, status='RUNNING')
Log metrics to the Plexus Dashboard with retry logic
- async maybe_start_metrics_task(score_name: str, is_final_result: bool = False)
Start a metrics computation task if one isn’t running, or if this is the final result
- async run(*args, **kwargs)
- async score_all_texts(selected_sample_rows)
Score all texts concurrently
- async score_all_texts_for_score(selected_sample_rows, score_name: str, tracker)
Score all texts for a specific score with controlled concurrency
- score_names()
- score_names_to_process()
- async score_text(row, score_name: str = None)
Score text with retry logic for handling timeouts and request exceptions
- start_mlflow_run()
- time_execution()
- class plexus.Evaluation.FeedbackEvaluation(*, days: int = 7, scorecard_id: str | None = None, score_id: str | None = None, evaluation_id: str | None = None, account_id: str | None = None, task_id: str | None = None, api_client=None, max_samples: int | None = None, **kwargs)
Bases:
EvaluationEvaluation that analyzes feedback items to measure agreement between AI predictions and human corrections over a time period.
This evaluation type: - Fetches feedback items for a scorecard/score over a specified time period - Calculates Gwet’s AC1 agreement coefficient (primary metric) - Calculates accuracy, precision, recall - Generates confusion matrix - Creates an evaluation record with all metrics
Unlike AccuracyEvaluation which runs predictions on a dataset, FeedbackEvaluation analyzes existing feedback corrections to measure real-world performance.
Initialize a FeedbackEvaluation.
- Args:
days: Number of days to look back for feedback items (default: 7) scorecard_id: ID of the scorecard to evaluate score_id: Optional ID of specific score to evaluate (if None, evaluates all scores) evaluation_id: ID of the evaluation record account_id: Account ID task_id: Optional task ID for progress tracking api_client: Optional API client (if not provided, will be created from kwargs) max_samples: Optional maximum number of feedback items to process (default: None = all) **kwargs: Additional arguments passed to parent Evaluation class
- __init__(*, days: int = 7, scorecard_id: str | None = None, score_id: str | None = None, evaluation_id: str | None = None, account_id: str | None = None, task_id: str | None = None, api_client=None, max_samples: int | None = None, **kwargs)
Initialize a FeedbackEvaluation.
- Args:
days: Number of days to look back for feedback items (default: 7) scorecard_id: ID of the scorecard to evaluate score_id: Optional ID of specific score to evaluate (if None, evaluates all scores) evaluation_id: ID of the evaluation record account_id: Account ID task_id: Optional task ID for progress tracking api_client: Optional API client (if not provided, will be created from kwargs) max_samples: Optional maximum number of feedback items to process (default: None = all) **kwargs: Additional arguments passed to parent Evaluation class
- async run(tracker=None)
Run the feedback evaluation.
- Args:
tracker: Optional TaskProgressTracker for progress updates
- Returns:
Dictionary with evaluation results