plexus.Evaluation module

class plexus.Evaluation.AccuracyEvaluation(*, override_folder: str | None = None, labeled_samples: list = None, labeled_samples_filename: str = None, score_id: str = None, score_version_id: str = None, visualize: bool = False, task_id: str = None, evaluation_id: str = None, account_id: str = None, account_key: str = None, scorecard_id: str = None, **kwargs)

Bases: Evaluation

__init__(*, override_folder: str | None = None, labeled_samples: list = None, labeled_samples_filename: str = None, score_id: str = None, score_version_id: str = None, visualize: bool = False, task_id: str = None, evaluation_id: str = None, account_id: str = None, account_key: str = None, scorecard_id: str = None, **kwargs)
get_score_instance(score_name: str)

Get a Score instance using the standardized Score.load() method.

This method now uses the DRY, tested Score.load() approach that handles: - API loading with local caching - YAML-only loading from local files - Proper error handling and dependency resolution

async run(tracker=None, progress_callback=None, dry_run=False)

Run the accuracy evaluation.

tracker is optional to maintain backward compatibility with callers that did not pass a tracker. When None, stage/progress updates are skipped.

class plexus.Evaluation.ConsistencyEvaluation(*, number_of_times_to_sample_each_text, **kwargs)

Bases: Evaluation

__init__(*, number_of_times_to_sample_each_text, **kwargs)
log_parameters()
class plexus.Evaluation.Evaluation(*, scorecard_name: str, scorecard: Scorecard, labeled_samples_filename: str = None, labeled_samples: list = None, number_of_texts_to_sample=100, sampling_method='random', random_seed=None, session_ids_to_sample=None, subset_of_score_names=None, experiment_label=None, max_mismatches_to_report=5, account_key: str = None, score_id: str = None, visualize: bool = False, task_id: str = None, allow_no_labels: bool = False)

Bases: object

Base class for evaluating Scorecard performance through accuracy testing and consistency checking.

Evaluation is used to measure how well a Scorecard performs against labeled data or to check consistency of results. It integrates with the Plexus dashboard for monitoring. The class supports:

  • Accuracy testing against labeled data

  • Consistency checking through repeated scoring

  • Automatic metrics calculation and visualization

  • Cost tracking and reporting

  • Real-time progress tracking in the dashboard

There are two main subclasses: 1. AccuracyEvaluation: Tests scorecard results against labeled data 2. ConsistencyEvaluation: Checks if scores are consistent when run multiple times

Common usage patterns: 1. Running an accuracy evaluation:

evaluation = AccuracyEvaluation(

scorecard_name=”qa”, scorecard=scorecard, labeled_samples_filename=”labeled_data.csv”

) await evaluation.run()

  1. Running a consistency check:
    evaluation = ConsistencyEvaluation(

    scorecard_name=”qa”, scorecard=scorecard, number_of_texts_to_sample=100, number_of_times_to_sample_each_text=3

    ) await evaluation.run()

  2. Using with context management:
    with evaluation:

    await evaluation.run()

  3. Monitoring in dashboard:
    evaluation = AccuracyEvaluation(

    scorecard_name=”qa”, scorecard=scorecard, account_key=”my-account”, score_id=”score-123”

    ) await evaluation.run() # Progress visible in dashboard

The Evaluation class is commonly used during model development to measure performance and during production to monitor for accuracy drift.

__init__(*, scorecard_name: str, scorecard: Scorecard, labeled_samples_filename: str = None, labeled_samples: list = None, number_of_texts_to_sample=100, sampling_method='random', random_seed=None, session_ids_to_sample=None, subset_of_score_names=None, experiment_label=None, max_mismatches_to_report=5, account_key: str = None, score_id: str = None, visualize: bool = False, task_id: str = None, allow_no_labels: bool = False)
calculate_metrics(results)
async cleanup()

Clean up all resources

async continuous_metrics_computation(score_name: str)

Background task that continuously computes and posts metrics for a specific score

create_performance_visualization(results, question, report_folder_path)
generate_and_log_confusion_matrix(results, report_folder_path)
generate_csv_scorecard_report(*, results)
generate_excel_report(report_folder_path, results, selected_sample_rows)
generate_metrics_json(report_folder_path, sample_size, expenses, calibration_report=None)
generate_report(score_instance, overall_accuracy, expenses, sample_size, final_metrics=None)
static get_evaluation_info(evaluation_id: str, include_score_results: bool = False) dict

Get detailed information about an evaluation by its ID.

Args:

evaluation_id: The ID of the evaluation to look up include_score_results: Whether to include score results in the response

Returns:

dict: Evaluation information including scorecard name, score name, and metrics

static get_latest_evaluation(account_key: str = None, evaluation_type: str = None) dict

Get information about the most recent evaluation.

Args:

account_key: Account key to filter by (default: from PLEXUS_ACCOUNT_KEY env var) evaluation_type: Optional filter by evaluation type (e.g., ‘accuracy’)

Returns:

dict: Latest evaluation information

log_parameters()
async log_to_dashboard(metrics, status='RUNNING')

Log metrics to the Plexus Dashboard with retry logic

async maybe_start_metrics_task(score_name: str, is_final_result: bool = False)

Start a metrics computation task if one isn’t running, or if this is the final result

async run(*args, **kwargs)
async score_all_texts(selected_sample_rows)

Score all texts concurrently

async score_all_texts_for_score(selected_sample_rows, score_name: str, tracker)

Score all texts for a specific score with controlled concurrency

score_names()
score_names_to_process()
async score_text(row, score_name: str = None)

Score text with retry logic for handling timeouts and request exceptions

start_mlflow_run()
time_execution()
class plexus.Evaluation.FeedbackEvaluation(*, days: int = 7, scorecard_id: str | None = None, score_id: str | None = None, evaluation_id: str | None = None, account_id: str | None = None, task_id: str | None = None, api_client=None, max_samples: int | None = None, **kwargs)

Bases: Evaluation

Evaluation that analyzes feedback items to measure agreement between AI predictions and human corrections over a time period.

This evaluation type: - Fetches feedback items for a scorecard/score over a specified time period - Calculates Gwet’s AC1 agreement coefficient (primary metric) - Calculates accuracy, precision, recall - Generates confusion matrix - Creates an evaluation record with all metrics

Unlike AccuracyEvaluation which runs predictions on a dataset, FeedbackEvaluation analyzes existing feedback corrections to measure real-world performance.

Initialize a FeedbackEvaluation.

Args:

days: Number of days to look back for feedback items (default: 7) scorecard_id: ID of the scorecard to evaluate score_id: Optional ID of specific score to evaluate (if None, evaluates all scores) evaluation_id: ID of the evaluation record account_id: Account ID task_id: Optional task ID for progress tracking api_client: Optional API client (if not provided, will be created from kwargs) max_samples: Optional maximum number of feedback items to process (default: None = all) **kwargs: Additional arguments passed to parent Evaluation class

__init__(*, days: int = 7, scorecard_id: str | None = None, score_id: str | None = None, evaluation_id: str | None = None, account_id: str | None = None, task_id: str | None = None, api_client=None, max_samples: int | None = None, **kwargs)

Initialize a FeedbackEvaluation.

Args:

days: Number of days to look back for feedback items (default: 7) scorecard_id: ID of the scorecard to evaluate score_id: Optional ID of specific score to evaluate (if None, evaluates all scores) evaluation_id: ID of the evaluation record account_id: Account ID task_id: Optional task ID for progress tracking api_client: Optional API client (if not provided, will be created from kwargs) max_samples: Optional maximum number of feedback items to process (default: None = all) **kwargs: Additional arguments passed to parent Evaluation class

async run(tracker=None)

Run the feedback evaluation.

Args:

tracker: Optional TaskProgressTracker for progress updates

Returns:

Dictionary with evaluation results