plexus.Evaluation module

class plexus.Evaluation.AccuracyEvaluation(*, override_folder: str | None = None, labeled_samples: list = None, labeled_samples_filename: str = None, score_id: str = None, score_version_id: str = None, visualize: bool = False, task_id: str = None, evaluation_id: str = None, account_id: str = None, account_key: str = None, scorecard_id: str = None, **kwargs)

Bases: Evaluation

__init__(*, override_folder: str | None = None, labeled_samples: list = None, labeled_samples_filename: str = None, score_id: str = None, score_version_id: str = None, visualize: bool = False, task_id: str = None, evaluation_id: str = None, account_id: str = None, account_key: str = None, scorecard_id: str = None, **kwargs)

get_score_instance(score_name: str)

Get a Score instance using the standardized Score.load() method.

This method now uses the DRY, tested Score.load() approach that handles: - API loading with local caching - YAML-only loading from local files - Proper error handling and dependency resolution

async run(tracker=None, progress_callback=None, dry_run=False)

Run the accuracy evaluation.

tracker is optional to maintain backward compatibility with callers that did not pass a tracker. When None, stage/progress updates are skipped.

class plexus.Evaluation.ConsistencyEvaluation(*, number_of_times_to_sample_each_text, **kwargs)

Bases: Evaluation

__init__(*, number_of_times_to_sample_each_text, **kwargs)

log_parameters()

class plexus.Evaluation.Evaluation(*, scorecard_name: str, scorecard: Scorecard, labeled_samples_filename: str = None, labeled_samples: list = None, number_of_texts_to_sample=100, sampling_method='random', random_seed=None, session_ids_to_sample=None, subset_of_score_names=None, experiment_label=None, max_mismatches_to_report=5, account_key: str = None, score_id: str = None, visualize: bool = False, task_id: str = None, allow_no_labels: bool = False)

Bases: object

Base class for evaluating Scorecard performance through accuracy testing and consistency checking.

Evaluation is used to measure how well a Scorecard performs against labeled data or to check consistency of results. It integrates with the Plexus dashboard for monitoring. The class supports:

Accuracy testing against labeled data
Consistency checking through repeated scoring
Automatic metrics calculation and visualization
Cost tracking and reporting
Real-time progress tracking in the dashboard

There are two main subclasses: 1. AccuracyEvaluation: Tests scorecard results against labeled data 2. ConsistencyEvaluation: Checks if scores are consistent when run multiple times

Common usage patterns: 1. Running an accuracy evaluation:

evaluation = AccuracyEvaluation(
scorecard_name=”qa”, scorecard=scorecard, labeled_samples_filename=”labeled_data.csv”

) await evaluation.run()

Running a consistency check:

evaluation = ConsistencyEvaluation(
scorecard_name=”qa”, scorecard=scorecard, number_of_texts_to_sample=100, number_of_times_to_sample_each_text=3

) await evaluation.run()
Using with context management:

with evaluation:
await evaluation.run()
Monitoring in dashboard:

evaluation = AccuracyEvaluation(
scorecard_name=”qa”, scorecard=scorecard, account_key=”my-account”, score_id=”score-123”

) await evaluation.run() # Progress visible in dashboard

The Evaluation class is commonly used during model development to measure performance and during production to monitor for accuracy drift.

__init__(*, scorecard_name: str, scorecard: Scorecard, labeled_samples_filename: str = None, labeled_samples: list = None, number_of_texts_to_sample=100, sampling_method='random', random_seed=None, session_ids_to_sample=None, subset_of_score_names=None, experiment_label=None, max_mismatches_to_report=5, account_key: str = None, score_id: str = None, visualize: bool = False, task_id: str = None, allow_no_labels: bool = False)

calculate_metrics(results)

async cleanup(): Clean up all resources

async continuous_metrics_computation(score_name: str): Background task that continuously computes and posts metrics for a specific score

create_performance_visualization(results, question, report_folder_path)

generate_and_log_confusion_matrix(results, report_folder_path)

generate_csv_scorecard_report(*, results)

generate_excel_report(report_folder_path, results, selected_sample_rows)

generate_metrics_json(report_folder_path, sample_size, expenses, calibration_report=None)

generate_report(score_instance, overall_accuracy, expenses, sample_size, final_metrics=None)

static get_evaluation_info(evaluation_id: str, include_score_results: bool = False) → dict

Get detailed information about an evaluation by its ID.

Args:: evaluation_id: The ID of the evaluation to look up include_score_results: Whether to include score results in the response
Returns:: dict: Evaluation information including scorecard name, score name, and metrics

static get_latest_evaluation(account_key: str = None, evaluation_type: str = None) → dict

Get information about the most recent evaluation.

Args:: account_key: Account key to filter by (default: from PLEXUS_ACCOUNT_KEY env var) evaluation_type: Optional filter by evaluation type (e.g., ‘accuracy’)
Returns:: dict: Latest evaluation information

log_parameters()

async log_to_dashboard(metrics, status='RUNNING'): Log metrics to the Plexus Dashboard with retry logic

async maybe_start_metrics_task(score_name: str, is_final_result: bool = False): Start a metrics computation task if one isn’t running, or if this is the final result

async run(*args, **kwargs)

async score_all_texts(selected_sample_rows): Score all texts concurrently

async score_all_texts_for_score(selected_sample_rows, score_name: str, tracker): Score all texts for a specific score with controlled concurrency

score_names()

score_names_to_process()

async score_text(row, score_name: str = None): Score text with retry logic for handling timeouts and request exceptions

start_mlflow_run()

time_execution()

Bases: Evaluation

Evaluation that analyzes feedback items to measure agreement between AI predictions and human corrections over a time period.

This evaluation type: - Fetches feedback items for a scorecard/score over a specified time period - Calculates Gwet’s AC1 agreement coefficient (primary metric) - Calculates accuracy, precision, recall - Generates confusion matrix - Creates an evaluation record with all metrics

Unlike AccuracyEvaluation which runs predictions on a dataset, FeedbackEvaluation analyzes existing feedback corrections to measure real-world performance.

Initialize a FeedbackEvaluation.

Args:: days: Number of days to look back for feedback items (default: 7) scorecard_id: ID of the scorecard to evaluate score_id: Optional ID of specific score to evaluate (if None, evaluates all scores) evaluation_id: ID of the evaluation record account_id: Account ID task_id: Optional task ID for progress tracking api_client: Optional API client (if not provided, will be created from kwargs) max_samples: Optional maximum number of feedback items to process (default: None = all) **kwargs: Additional arguments passed to parent Evaluation class

Initialize a FeedbackEvaluation.

Args:: days: Number of days to look back for feedback items (default: 7) scorecard_id: ID of the scorecard to evaluate score_id: Optional ID of specific score to evaluate (if None, evaluates all scores) evaluation_id: ID of the evaluation record account_id: Account ID task_id: Optional task ID for progress tracking api_client: Optional API client (if not provided, will be created from kwargs) max_samples: Optional maximum number of feedback items to process (default: None = all) **kwargs: Additional arguments passed to parent Evaluation class

async run(tracker=None)

Run the feedback evaluation.

Args:: tracker: Optional TaskProgressTracker for progress updates
Returns:: Dictionary with evaluation results