plexus.cli.evaluation.evaluations module

plexus.cli.evaluation.evaluations.assert_dataset_materialized_for_accuracy(dataset: dict) Dict[str, Any]

Fail fast when a dataset-backed accuracy run points to a non-materialized dataset.

plexus.cli.evaluation.evaluations.build_dataset_materialization_failure_message(*, dataset_id: str, reason: str, dataset_file: str | None, next_step_hint: str) str
plexus.cli.evaluation.evaluations.check_dict_serializability(d, path='')
plexus.cli.evaluation.evaluations.create_client() PlexusDashboardClient

Create a client and log its configuration

plexus.cli.evaluation.evaluations.format_confusion_matrix_summary(final_metrics)

Format confusion matrix and detailed metrics for the evaluation summary.

plexus.cli.evaluation.evaluations.get_amplify_bucket()

Get the S3 bucket name from environment variables or fall back to reading amplify_outputs.json.

plexus.cli.evaluation.evaluations.get_csv_samples(csv_filename)
plexus.cli.evaluation.evaluations.get_data_driven_samples(scorecard_instance, scorecard_name, score_name, score_config, fresh, reload, content_ids_to_sample_set, progress_callback=None, number_of_samples=None, random_seed=None)
plexus.cli.evaluation.evaluations.get_dataset_by_id(client: PlexusDashboardClient, dataset_id: str) dict

Get a specific DataSet by ID

plexus.cli.evaluation.evaluations.get_latest_accuracy_evaluation_for_score_since(client: PlexusDashboardClient, score_id: str, created_after_iso: str) dict | None
plexus.cli.evaluation.evaluations.get_latest_associated_dataset_for_score(client: PlexusDashboardClient, score_id: str) dict
plexus.cli.evaluation.evaluations.get_latest_dataset_for_data_source(client: PlexusDashboardClient, data_source_id: str) dict

Get the most recent DataSet for a DataSource by finding its current version

plexus.cli.evaluation.evaluations.get_latest_score_version(client, score_id: str) str | None

Get the most recent ScoreVersion ID for a given score using the scoreId index sorted by createdAt.

Args:

client: GraphQL API client score_id: The score ID to get the latest version for

Returns:

The latest ScoreVersion ID, or None if no versions found

plexus.cli.evaluation.evaluations.is_json_serializable(obj)
plexus.cli.evaluation.evaluations.list_associated_datasets_for_score(client: PlexusDashboardClient, score_id: str, limit: int = 200) list[dict]

List datasets associated to a score ordered newest-first by createdAt then id.

plexus.cli.evaluation.evaluations.load_configuration_from_yaml_file(configuration_file_path)

Load configuration from a YAML file.

plexus.cli.evaluation.evaluations.load_samples_from_cloud_dataset(dataset: dict, score_name: str, score_config: dict, number_of_samples: int | None = None, random_seed: int | None = None, progress_callback=None) list

Load samples from a cloud dataset (Parquet file) and convert to evaluation format

plexus.cli.evaluation.evaluations.load_scorecard_from_api(scorecard_identifier: str, score_names=None, use_cache=False, specific_version=None)

Load a scorecard from the Plexus Dashboard API.

Args:

scorecard_identifier: A string that can identify the scorecard (id, key, name, etc.) score_names: Optional list of specific score names to load use_cache: Whether to prefer local cache files over API (default: False)

When False, will always fetch from API but still write cache files When True, will check local cache first and only fetch missing configs

specific_version: Optional specific score version ID to use instead of champion version

Returns:

Scorecard: An initialized Scorecard instance with required scores loaded

Raises:

ValueError: If the scorecard cannot be found

plexus.cli.evaluation.evaluations.load_scorecard_from_yaml_files(scorecard_identifier: str, score_names=None, specific_version=None)

Load a scorecard from individual YAML configuration files saved by fetch_score_configurations.

Args:

scorecard_identifier: A string that identifies the scorecard (ID, name, key, or external ID) score_names: Optional list of specific score names to load specific_version: Optional specific score version ID (Note: YAML files contain champion versions only)

Returns:

Scorecard: An initialized Scorecard instance with required scores loaded from YAML files

Raises:

ValueError: If the scorecard cannot be constructed from YAML files

plexus.cli.evaluation.evaluations.log_scorecard_configurations(scorecard_instance, context='')

Log the actual configurations being used by the scorecard instance.

plexus.cli.evaluation.evaluations.lookup_data_source(client: PlexusDashboardClient, name: str | None = None, key: str | None = None, id: str | None = None) dict

Look up a DataSource by name, key, or ID

plexus.cli.evaluation.evaluations.resolve_cloud_dataset_sample_limit(*, number_of_samples: int | None, number_of_samples_explicit: bool) int | None

Determine dataset-backed sample cap.

For cloud/associated datasets, default CLI sample size should not silently cap the dataset. Only apply a cap when the operator explicitly sets –number-of-samples.

plexus.cli.evaluation.evaluations.resolve_primary_score_id_for_accuracy(client: PlexusDashboardClient, scorecard_identifier: str, score_identifier: str, use_yaml: bool, specific_version: str | None) str
plexus.cli.evaluation.evaluations.resolve_score_external_id_to_uuid(client: PlexusDashboardClient, external_id: str, scorecard_id: str = None) str

Resolve a score external ID to its DynamoDB UUID using GraphQL API.

Args:

client: PlexusDashboardClient instance external_id: The external ID to resolve (e.g., “45925”) scorecard_id: Optional scorecard ID to narrow the search

Returns:

str: DynamoDB UUID for the score, or None if not found

plexus.cli.evaluation.evaluations.score_text_wrapper(scorecard_instance, text, score_name, scorecard_name=None, executor=None)

Wrapper to handle the scoring of text with proper error handling and logging.

This function is called from within an async context (_run_accuracy), so we expect an event loop to be running. We use ThreadPoolExecutor to run the async score_entire_text method in a separate thread to avoid nested loop issues.

plexus.cli.evaluation.evaluations.truncate_dict_strings(d, max_length=100)

Recursively truncate long string values in a dictionary.

plexus.cli.evaluation.evaluations.validate_dataset_materialization(dataset: dict) Dict[str, Any]

Validate dataset-backed accuracy readiness from canonical DataSet.file.