aws
Tools to interact with elements of the Amazon Web Services (AWS).
Specifically, data scientists tend to interact heavily with S3 object storage.
- class S3(location='eu-west-1', api_version=None, use_ssl=True, verify=True, endpoint_url=None, aws_account_id=None, aws_access_key_id=None, aws_secret_access_key=None, aws_session_token=None, **kwargs)[source]
Bases:
ArgRepr
Wraps an S3 client for delayed instantiation and config encapsulation.
For more convenient function chaining in functional compositions, instances are also callable and simply return a new client when called.
- Parameters:
location (str, optional) – The name of the region associated with the client. See the list of available ` locations <https://docs.aws.amazon.com/ global-infrastructure/latest/regions/aws-regions.html>`_ for choices. Defaults to
eu-west-1
(Ireland).api_version (str, optional) – The API version to use. By default, botocore will use the latest API version when creating a client. You only need to specify this parameter if you want to use a previous API version of the client. Defaults to
None
.use_ssl (bool, optional) – Whether to use SSL. Defaults to
True
.verify (bool or str, optional) – Whether to verify SSL certificates. Can be set to
True
, a path/to/cert/bundle.pem, orFalse
. Defaults toTrue
.endpoint_url (str, optional) – The complete URL to use for the constructed client. Normally, botocore will automatically construct the appropriate URL to use when communicating with a service. You can specify a complete URL (including the “http/https” scheme) to override this behavior. If this value is provided, then use_ssl is ignored.
aws_account_id (str, optional) – The account id to use when creating the client. Defaults to
None
.aws_access_key_id (str, optional) – The access key to use when creating the client. Defaults to
None
.aws_secret_access_key (str, optional) – The secret key to use when creating the client. Defaults to
None
.aws_session_token (str, optional) – The session token to use for the client. Defaults to
None
.**kwargs – Additional parameters for the client. See the documentation of context and config for options. Defaults to
None
.
- Raises:
AttributeError – If location is not a string.
- property client
New S3 client on first request, cached for subsequent requests.
- class S3Bucket(s3, bucket, location='eu-west-1', exists_ok=False, age=None)[source]
Bases:
ArgRepr
Create/retrieve a bucket on/from S3-compatible object storage.
- Parameters:
s3 (S3) – An instance of a wrapped S3 client.
bucket (str) – The (unique!) name of the bucket to create. May include any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called.
location (str, optional) – The physical datacenter location to create the bucket in. See the AWS documentation for options. Defaults to “eu-west-1”.
exists_ok (bool, optional) – Whether quietly return the requested bucket if it exists or raise an. exception. Defaults to
False
.age (int, optional) – Defaults to
None
. If set, objects older than the specified number of days will be automatically deleted.
- Raises:
AttributeError – If bucket or location are not strings.
TypeError – If age cannot be cast to an integer.
ValueError – If age is less than one.
See also
- __call__(*parts)[source]
Create/retrieve an S3 bucket with the cached options.
- Parameters:
*parts (str) – Fragments that will be interpolated into the bucket given at instantiation. Obviously, there must be at least as many as there are placeholders in the bucket.
- Returns:
str – The name of the existing or newly created bucket.
bool –
True
if the requested bucket is newly created andFalse
if an existing bucket is returned.
- Raises:
S3Error – If exists_ok is set to
False
but the bucket already exists.
- property config
Minimal configuration used for bucket creation (if required).
- property lifecycle
Minimal configuration for adding a life-cycle rule (if required).
- class S3Parquet2DataFrame(s3, bucket, prefix='', bear=Bears.PANDAS, get_kws=None, **kwargs)[source]
Bases:
ArgRepr
,Generic
Load a single parquet file from S3 object storage into a data frame.
Type-annotate classes on instantiation with either a pandas or a polars dataframe so that static type checkers can infer the return type of the callable instances!
- Parameters:
s3 (S3) – An instance of a wrapped S3 client.
bucket (str) – The name of the bucket to upload to.
prefix (str, optional) – The prefix of the parquet file to download. Since it (or part of it) can also be provided later, when the callable instance is called, it is optional here. Defaults to an empty string.
bear (str, optional) – Type of dataframe to return. Can be one of “pandas” or “polars”. Use the
Bears
enum to avoid typos. Defaults to “pandas”.get_kws (dict, optional) – Keyword arguments (in addition to Bucket and Key) to pass to the get_object method of the client. Defaults to
None
.**kwargs – Additional keyword arguments are passed on to the top-level
read_parquet
function of either pandas or polars.
- Raises:
AttributeError – If bucket, prefix, or bear are not, in fact, strings.
- __call__(path='')[source]
Download a single parquet file from S3 object storage.
- Parameters:
path – The path to the parquet file to load. If given here, it will be appended to the prefix given at instantiation time. Defaults to an empty string.
- Returns:
A pandas or polars dataframe, depending on bear.
- Return type:
DataFrame
- property read_parquet
Top-level
read_parquet
function of either pandas or polars.
- class DataFrame2S3Parquet(s3, bucket, prefix='', overwrite=False, skip=False, extra_kws=None, upload_kws=None, **kwargs)[source]
Bases:
ArgRepr
Upload a pandas or polars dataframe to an S3 bucket.
- Parameters:
s3 (S3) – An instance of a wrapped S3 client.
bucket (str) – The name of the bucket to upload to.
prefix (str, optional) – The prefix of the parquet file to upload the dataframe to. May include any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called. Defaults to an empty string.
overwrite (bool, optional) – Whether to silently overwrite the destination blob on S3. Defaults to
False
, which will raise an exception if it already exists.skip (bool, optional) – Whether to silently do nothing if the destination blob on S3 already exists. Defaults to
False
.extra_kws (dict, optional) – Passed on as
ExtraArgs
to the upload_fileobj method of the client. See the docs for all options.upload_kws (dict, optional) – Passed on as
Config
to the upload_fileobj method of the client. See the docs for all options.**kwargs – Additional keyword arguments are passed on to the
to_parquet
orwrite_parquet
method of the dataframe.
- Raises:
AttributeError – If either bucket or prefix are not, in fact, strings.
See also
- __call__(df, *parts)[source]
Write a pandas or polars dataframe to S3 object storage.
- Parameters:
df (DataFrame) – The pandas or polars dataframe to upload.
*parts (str) – Fragments that will be interpolated into the prefix given at instantiation. Obviously, there must be at least as many as there are placeholders in the prefix.
- Returns:
An empty tuple.
- Return type:
tuple