plexus.data.AWSDataLakeCache module

class plexus.data.AWSDataLakeCache.AWSDataLakeCache(**parameters)

Bases: DataCache

A class to handle caching and retrieval of data from AWS Athena and S3.

Initialize the DataCache instance with the given parameters.

Parameters

**parametersdict

Arbitrary keyword arguments that are used to initialize the Parameters instance.

Raises

ValidationError

If the provided parameters do not pass validation.

class Parameters(*, class_name: str = 'DataCache', local_cache_directory: str = './.plexus_training_data_cache/')

Bases: Parameters

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

local_cache_directory: str
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

__init__(**parameters)

Initialize the DataCache instance with the given parameters.

Parameters

**parametersdict

Arbitrary keyword arguments that are used to initialize the Parameters instance.

Raises

ValidationError

If the provided parameters do not pass validation.

download_content_item(scorecard_id, content_id)
execute_athena_query(query)
execute_batch_athena_queries(metadata_item, values, scorecard_id)
get_query_results(query_execution_id)
load_dataframe(*, data, fresh=False)

Load a dataframe based on the provided parameters.

Returns

pd.DataFrame

The loaded dataframe.

This method must be implemented by all subclasses.

process_content_item(scorecard_id, content_id)
split_into_batches(form_ids, batch_size=2000)