aws

Tools to interact with elements of the Amazon Web Services (AWS).

Specifically, data scientists tend to interact heavily with S3 object storage.

class S3(location='eu-west-1', api_version=None, use_ssl=True, verify=True, endpoint_url=None, aws_account_id=None, aws_access_key_id=None, aws_secret_access_key=None, aws_session_token=None, **kwargs)[source]

Bases: ArgRepr

Wraps an S3 client for delayed instantiation and config encapsulation.

For more convenient function chaining in functional compositions, instances are also callable and simply return a new client when called.

Parameters:
  • location (str, optional) – The name of the region associated with the client. See the list of available ` locations <https://docs.aws.amazon.com/ global-infrastructure/latest/regions/aws-regions.html>`_ for choices. Defaults to eu-west-1 (Ireland).

  • api_version (str, optional) – The API version to use. By default, botocore will use the latest API version when creating a client. You only need to specify this parameter if you want to use a previous API version of the client. Defaults to None.

  • use_ssl (bool, optional) – Whether to use SSL. Defaults to True.

  • verify (bool or str, optional) – Whether to verify SSL certificates. Can be set to True, a path/to/cert/bundle.pem, or False. Defaults to True.

  • endpoint_url (str, optional) – The complete URL to use for the constructed client. Normally, botocore will automatically construct the appropriate URL to use when communicating with a service. You can specify a complete URL (including the “http/https” scheme) to override this behavior. If this value is provided, then use_ssl is ignored.

  • aws_account_id (str, optional) – The account id to use when creating the client. Defaults to None.

  • aws_access_key_id (str, optional) – The access key to use when creating the client. Defaults to None.

  • aws_secret_access_key (str, optional) – The secret key to use when creating the client. Defaults to None.

  • aws_session_token (str, optional) – The session token to use for the client. Defaults to None.

  • **kwargs – Additional parameters for the client. See the documentation of context and config for options. Defaults to None.

Raises:

AttributeError – If location is not a string.

__call__(*_, **__)[source]

New S3 client on every call, ignoring any (keyword) arguments.

property client

New S3 client on first request, cached for subsequent requests.

class S3Bucket(s3, bucket, location='eu-west-1', exists_ok=False, age=None)[source]

Bases: ArgRepr

Create/retrieve a bucket on/from S3-compatible object storage.

Parameters:
  • s3 (S3) – An instance of a wrapped S3 client.

  • bucket (str) – The (unique!) name of the bucket to create. May include any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called.

  • location (str, optional) – The physical datacenter location to create the bucket in. See the AWS documentation for options. Defaults to “eu-west-1”.

  • exists_ok (bool, optional) – Whether quietly return the requested bucket if it exists or raise an. exception. Defaults to False.

  • age (int, optional) – Defaults to None. If set, objects older than the specified number of days will be automatically deleted.

Raises:
  • AttributeError – If bucket or location are not strings.

  • TypeError – If age cannot be cast to an integer.

  • ValueError – If age is less than one.

See also

S3

__call__(*parts)[source]

Create/retrieve an S3 bucket with the cached options.

Parameters:

*parts (str) – Fragments that will be interpolated into the bucket given at instantiation. Obviously, there must be at least as many as there are placeholders in the bucket.

Returns:

  • str – The name of the existing or newly created bucket.

  • boolTrue if the requested bucket is newly created and False if an existing bucket is returned.

Raises:

S3Error – If exists_ok is set to False but the bucket already exists.

property config

Minimal configuration used for bucket creation (if required).

property lifecycle

Minimal configuration for adding a life-cycle rule (if required).

class S3Parquet2DataFrame(s3, bucket, prefix='', bear=Bears.PANDAS, get_kws=None, **kwargs)[source]

Bases: ArgRepr, Generic

Load a single parquet file from S3 object storage into a data frame.

Type-annotate classes on instantiation with either a pandas or a polars dataframe so that static type checkers can infer the return type of the callable instances!

Parameters:
  • s3 (S3) – An instance of a wrapped S3 client.

  • bucket (str) – The name of the bucket to upload to.

  • prefix (str, optional) – The prefix of the parquet file to download. Since it (or part of it) can also be provided later, when the callable instance is called, it is optional here. Defaults to an empty string.

  • bear (str, optional) – Type of dataframe to return. Can be one of “pandas” or “polars”. Use the Bears enum to avoid typos. Defaults to “pandas”.

  • get_kws (dict, optional) – Keyword arguments (in addition to Bucket and Key) to pass to the get_object method of the client. Defaults to None.

  • **kwargs – Additional keyword arguments are passed on to the top-level read_parquet function of either pandas or polars.

Raises:

AttributeError – If bucket, prefix, or bear are not, in fact, strings.

See also

S3, Bears

__call__(path='')[source]

Download a single parquet file from S3 object storage.

Parameters:

path – The path to the parquet file to load. If given here, it will be appended to the prefix given at instantiation time. Defaults to an empty string.

Returns:

A pandas or polars dataframe, depending on bear.

Return type:

DataFrame

property read_parquet

Top-level read_parquet function of either pandas or polars.

class DataFrame2S3Parquet(s3, bucket, prefix='', overwrite=False, skip=False, extra_kws=None, upload_kws=None, **kwargs)[source]

Bases: ArgRepr

Upload a pandas or polars dataframe to an S3 bucket.

Parameters:
  • s3 (S3) – An instance of a wrapped S3 client.

  • bucket (str) – The name of the bucket to upload to.

  • prefix (str, optional) – The prefix of the parquet file to upload the dataframe to. May include any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called. Defaults to an empty string.

  • overwrite (bool, optional) – Whether to silently overwrite the destination blob on S3. Defaults to False, which will raise an exception if it already exists.

  • skip (bool, optional) – Whether to silently do nothing if the destination blob on S3 already exists. Defaults to False.

  • extra_kws (dict, optional) – Passed on as ExtraArgs to the upload_fileobj method of the client. See the docs for all options.

  • upload_kws (dict, optional) – Passed on as Config to the upload_fileobj method of the client. See the docs for all options.

  • **kwargs – Additional keyword arguments are passed on to the to_parquet or write_parquet method of the dataframe.

Raises:

AttributeError – If either bucket or prefix are not, in fact, strings.

See also

S3

__call__(df, *parts)[source]

Write a pandas or polars dataframe to S3 object storage.

Parameters:
  • df (DataFrame) – The pandas or polars dataframe to upload.

  • *parts (str) – Fragments that will be interpolated into the prefix given at instantiation. Obviously, there must be at least as many as there are placeholders in the prefix.

Returns:

An empty tuple.

Return type:

tuple