plexus.cli.evaluation.evaluations module

plexus.cli.evaluation.evaluations.assert_dataset_materialized_for_accuracy(dataset: dict) → Dict[str, Any]: Fail fast when a dataset-backed accuracy run points to a non-materialized dataset.

plexus.cli.evaluation.evaluations.build_dataset_materialization_failure_message(*, dataset_id: str, reason: str, dataset_file: str | None, next_step_hint: str) → str

plexus.cli.evaluation.evaluations.check_dict_serializability(d, path='')

plexus.cli.evaluation.evaluations.create_client() → PlexusDashboardClient: Create a client and log its configuration

plexus.cli.evaluation.evaluations.format_confusion_matrix_summary(final_metrics): Format confusion matrix and detailed metrics for the evaluation summary.

plexus.cli.evaluation.evaluations.get_amplify_bucket(): Get the S3 bucket name from environment variables or fall back to reading amplify_outputs.json.

plexus.cli.evaluation.evaluations.get_csv_samples(csv_filename)

plexus.cli.evaluation.evaluations.get_data_driven_samples(scorecard_instance, scorecard_name, score_name, score_config, fresh, reload, content_ids_to_sample_set, progress_callback=None, number_of_samples=None, random_seed=None)

plexus.cli.evaluation.evaluations.get_dataset_by_id(client: PlexusDashboardClient, dataset_id: str) → dict: Get a specific DataSet by ID

plexus.cli.evaluation.evaluations.get_latest_accuracy_evaluation_for_score_since(client: PlexusDashboardClient, score_id: str, created_after_iso: str) → dict | None

plexus.cli.evaluation.evaluations.get_latest_associated_dataset_for_score(client: PlexusDashboardClient, score_id: str) → dict

plexus.cli.evaluation.evaluations.get_latest_dataset_for_data_source(client: PlexusDashboardClient, data_source_id: str) → dict: Get the most recent DataSet for a DataSource by finding its current version

plexus.cli.evaluation.evaluations.get_latest_score_version(client, score_id: str) → str | None

Get the most recent ScoreVersion ID for a given score using the scoreId index sorted by createdAt.

Args:: client: GraphQL API client score_id: The score ID to get the latest version for
Returns:: The latest ScoreVersion ID, or None if no versions found

plexus.cli.evaluation.evaluations.is_json_serializable(obj)

plexus.cli.evaluation.evaluations.list_associated_datasets_for_score(client: PlexusDashboardClient, score_id: str, limit: int = 200) → list[dict]: List datasets associated to a score ordered newest-first by createdAt then id.

plexus.cli.evaluation.evaluations.load_configuration_from_yaml_file(configuration_file_path): Load configuration from a YAML file.

plexus.cli.evaluation.evaluations.load_samples_from_cloud_dataset(dataset: dict, score_name: str, score_config: dict, number_of_samples: int | None = None, random_seed: int | None = None, progress_callback=None) → list: Load samples from a cloud dataset (Parquet file) and convert to evaluation format

plexus.cli.evaluation.evaluations.load_scorecard_from_api(scorecard_identifier: str, score_names=None, use_cache=False, specific_version=None)

Load a scorecard from the Plexus Dashboard API.

Args:

scorecard_identifier: A string that can identify the scorecard (id, key, name, etc.) score_names: Optional list of specific score names to load use_cache: Whether to prefer local cache files over API (default: False)

When False, fetch from API and keep configurations in memory only When True, check local cache first and only fetch missing configs

specific_version: Optional specific score version ID to use instead of champion version

Returns:

Scorecard: An initialized Scorecard instance with required scores loaded

Raises:

ValueError: If the scorecard cannot be found

plexus.cli.evaluation.evaluations.load_scorecard_from_yaml_files(scorecard_identifier: str, score_names=None, specific_version=None)

Load a scorecard from individual YAML configuration files saved by fetch_score_configurations.

Args:: scorecard_identifier: A string that identifies the scorecard (ID, name, key, or external ID) score_names: Optional list of specific score names to load specific_version: Optional specific score version ID (Note: YAML files contain champion versions only)
Returns:: Scorecard: An initialized Scorecard instance with required scores loaded from YAML files
Raises:: ValueError: If the scorecard cannot be constructed from YAML files

plexus.cli.evaluation.evaluations.log_scorecard_configurations(scorecard_instance, context=''): Log the actual configurations being used by the scorecard instance.

plexus.cli.evaluation.evaluations.lookup_data_source(client: PlexusDashboardClient, name: str | None = None, key: str | None = None, id: str | None = None) → dict: Look up a DataSource by name, key, or ID

plexus.cli.evaluation.evaluations.resolve_cloud_dataset_sample_limit(*, number_of_samples: int | None, number_of_samples_explicit: bool) → int | None

Determine dataset-backed sample cap.

For cloud/associated datasets, default CLI sample size should not silently cap the dataset. Only apply a cap when the operator explicitly sets –number-of-samples.

plexus.cli.evaluation.evaluations.resolve_primary_score_id_for_accuracy(client: PlexusDashboardClient, scorecard_identifier: str, score_identifier: str, use_yaml: bool, specific_version: str | None) → str

plexus.cli.evaluation.evaluations.resolve_score_external_id_to_uuid(client: PlexusDashboardClient, external_id: str, scorecard_id: str = None) → str

Resolve a score external ID to its DynamoDB UUID using GraphQL API.

Args:: client: PlexusDashboardClient instance external_id: The external ID to resolve (e.g., “45925”) scorecard_id: Optional scorecard ID to narrow the search
Returns:: str: DynamoDB UUID for the score, or None if not found

plexus.cli.evaluation.evaluations.score_text_wrapper(scorecard_instance, text, score_name, scorecard_name=None, executor=None)

Wrapper to handle the scoring of text with proper error handling and logging.

This function is called from within an async context (_run_accuracy), so we expect an event loop to be running. We use ThreadPoolExecutor to run the async score_entire_text method in a separate thread to avoid nested loop issues.

plexus.cli.evaluation.evaluations.truncate_dict_strings(d, max_length=100): Recursively truncate long string values in a dictionary.

plexus.cli.evaluation.evaluations.validate_dataset_materialization(dataset: dict) → Dict[str, Any]: Validate dataset-backed accuracy readiness from canonical DataSet.file.