plexus.cli.shared.evaluation_runner module

plexus.cli.shared.evaluation_runner.create_tracker_and_evaluation(*, client: PlexusDashboardClient, account_id: str, scorecard_name: str, number_of_samples: int, sampling_method: str = 'random', score_name: str | None = None) → Tuple[TaskProgressTracker, Evaluation]

Create a TaskProgressTracker and an Evaluation record for an accuracy evaluation.

Mirrors the CLI behavior so other callers (e.g., MCP server tools) don’t duplicate logic. Returns the tracker (with its Task created/claimed) and the Evaluation record.

async plexus.cli.shared.evaluation_runner.run_accuracy_evaluation(*, scorecard_name: str, score_name: str | None = None, number_of_samples: int = 10, sampling_method: str = 'random', client: PlexusDashboardClient | None = None, account_id: str | None = None, fresh: bool = True, reload: bool = False, use_yaml: bool = True) → dict

Run a complete accuracy evaluation using the same logic as CLI.

This is the shared function that both CLI and MCP should use to avoid code duplication.