plexus.data.AWSDataLakeCache module
- class plexus.data.AWSDataLakeCache.AWSDataLakeCache(**parameters)
Bases:
DataCache
A class to handle caching and retrieval of data from AWS Athena and S3.
Initialize the DataCache instance with the given parameters.
Parameters
- **parametersdict
Arbitrary keyword arguments that are used to initialize the Parameters instance.
Raises
- ValidationError
If the provided parameters do not pass validation.
- class Parameters(*, class_name: str = 'DataCache', local_cache_directory: str = './.plexus_training_data_cache/')
Bases:
Parameters
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- local_cache_directory: str
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- __init__(**parameters)
Initialize the DataCache instance with the given parameters.
Parameters
- **parametersdict
Arbitrary keyword arguments that are used to initialize the Parameters instance.
Raises
- ValidationError
If the provided parameters do not pass validation.
- download_content_item(scorecard_id, content_id)
- execute_athena_query(query)
- execute_batch_athena_queries(metadata_item, values, scorecard_id)
- get_query_results(query_execution_id)
- load_dataframe(*, data, fresh=False)
Load a dataframe based on the provided parameters.
Returns
- pd.DataFrame
The loaded dataframe.
This method must be implemented by all subclasses.
- process_content_item(scorecard_id, content_id)
- split_into_batches(form_ids, batch_size=2000)