gcp

Tools to interact with elements of the Google Cloud Project (GCP).

Specifically, data scientists tend to interact mostly with Google’s BigQuery (BQ) data-warehouse solution and the Google Cloud Storage (GCS).

class Gcs(project, *args, **kwargs)[source]

Bases: ArgRepr

Wraps a Google Cloud Storage (GCS) client for delayed instantiation.

For more convenient function chaining in functional compositions, instances are also callable and simply return a new client when called.

Parameters:
  • project (str) – The name of the project the client should act on.

  • *args – Additional arguments to be passed to the client init.

  • **kwargs – Additional keyword arguments to be passed to the client init. See the reference for options.

__call__(*_, **__)[source]

New GCS client on every call, ignoring any (keyword) arguments.

property client

New GCS client on first request, cached for subsequent requests.

class GbqDataset(project, dataset, location, name=None, description=None, table_expire_days=None, partition_expire_days=None, labels=None, access=None, case_sensitive=True, collation=None, rounding=None, max_travel_time_hours=168, billing=None, tags=None, **kwargs)[source]

Bases: object

Create a new dataset in a Google BigQuery project.

Parameters:
  • project (str) – The project to create the dataset in.

  • dataset (str) – The identifier of the dataset to create. Only letters, numbers, and underscored are permitted.

  • location (str) – The physical datacenter location to create the dataset in. See the Google Cloud Platform documentation for options.

  • name (str, optional) – A human-readable name of the dataset. Defaults to the dataset it.

  • description (str, optional) – A short description of the dataset. Defaults to None.

  • table_expire_days (int, optional) – Number of days after which tables are dropped. Defaults to None, which results in tables never being dropped.

  • partition_expire_days (int, optional) – Number of days after which partitions of partitioned tables are dropped. Defaults to None, which results in partitions never being dropped.

  • labels (dict, optional) – Any number of string-valued labels of the dataset. Defaults to none.

  • access (list of dict, optional) – Fined-grained access rights to the dataset (see the Google Cloud Platform documentation for details). If not given, defaults access rights are set by Google BigQuery.

  • case_sensitive (bool, optional) – Whether dataset and table names should be case-sensitive or not. Defaults to True.

  • collation (str, optional) – Default collation mode for string sorting in string columns of tables. Defaults to None, which results in case-sensitive behavior. Use the Collation enum to specify explicitly.

  • rounding (str, optional) – Default rounding mode. Defaults to None, which lets Google BigQuery choose. Use the Rounding enum to specify explicitly.

  • max_travel_time_hours (int, optional) – Define duration of Google Bigquery’s “time travel” window in hours, i.e., for how long changes can be rolled back and tables can be queried “as of” some previous time. Values can be between 48 and 168 hours (2 to 7 days). Defaults to 168.

  • billing (str, optional) – Default billing mode for tables. Defaults to None, which lets Google BigQuery choose. Use the Billing enum to specify explicitly.

  • tags (dict, optional) – Associate globally defined tags with this dataset. Defaults to None, which result in no tags to be associated.

  • **kwargs – Additional keyword arguments are passed to the constructor of the Google BigQuery Client (see documentation for options).

Note

Options for linked and external dataset sources, as well as for encryption configuration are deliberately omitted. You probably should not play with those without consulting your organization’s data engineers.

__call__(exists_ok=True, retry=None, timeout=None)[source]

Create a Google BigQuery dataset in a Google Cloud Platform project.

Parameters:
  • exists_ok (bool, optional) – Whether to raise a Conflict exception if the targeted dataset already exists or not. Defaults to True.

  • retry (Retry, optional) – Retry policy for the request. Defaults to None, which disables retries. See the Google Cloud Platform guide and reference for options.

  • timeout (float, optional) – The number of seconds to wait for the HTTP response to the API call before using retry or a tuple with separate values for connection and request timeouts. Defaults to None, meaning wait forever.

Raises:

Conflict – If exists_ok is set to False and the dataset already exists.

Returns:

  • Dataset – The existing or newly created dataset. If existing, then the dataset is returned unchanged, that is, none of the specified options are applied.

  • boolTrue if the requested dataset is newly created and False if an existing dataset is returned.

property api_repr

Payload for the API call to the Google Cloud Platform.

static to_ms(days)[source]

Convert integer days to millisecond string for the GCP API call.

class GcsBucket(gcs, bucket, location='EUROPE-NORTH1', exists_ok=False, age=None, user_project=None, requester_pays=False, **kwargs)[source]

Bases: ArgRepr

Create/retrieve and configure a bucket on Google Cloud Storage (GCS).

Parameters:
  • gcs (Gcs) – An instance of a wrapped GCS client.

  • bucket (str) – The (unique!) name of the bucket to create. May include any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called.

  • location (str, optional) – The physical datacenter location to create the bucket in. See the Google Cloud Platform documentation for options. Defaults to “EUROPE-NORTH1”.

  • exists_ok (bool, optional) – Whether quietly return the requested bucket if it exists or raise an. exception. Defaults to False.

  • age (int, optional) – Defaults to None. If set, blobs older than the specified number of days will be automatically deleted.

  • user_project (str, optional) – The project billed for interacting with the bucket. Defaults to the project carried by the gcs client.

  • requester_pays (bool, optional) – Whether the requester will be billed for interacting with the bucket. Defaults to False, meaning that the (user_)`project` pays.

  • **kwargs – Any bucket property to set/change on the created/retrieved bucket. See the GCS docs for all options.

Raises:
  • AttributeError – If bucket or location are not strings.

  • TypeError – If age cannot be cast to an integer.

  • ValueError – If age is less than one.

See also

Gcs

__call__(*parts)[source]

Create/retrieve and configure a bucket on/from GCS..

Parameters:

*parts (str) – Fragments that will be interpolated into the bucket given at instantiation. Obviously, there must be at least as many as there are placeholders in the bucket.

Returns:

  • str – The name of the existing or newly created bucket.

  • boolTrue if the requested bucket is newly created and False if an existing bucket is returned.

Raises:

GcsError – If exists_ok is set to False but the bucket already exists or if you try to set an invalid bucket property from the kwargs.

property lifecycle

Minimal configuration for adding a life-cycle rule (if required).

class GbqQuery(project, polling_interval=5, priority='BATCH', **kwargs)[source]

Bases: ArgRepr

Run a SQL query that does not return anything on Google BigQuery.

Parameters:
  • project (str) – The name of the Google billing project.

  • polling_interval (int, optional) – Job completion is going to be checked for every polling_interval seconds. Defaults to 5 (seconds).

  • priority (str, optional) – Priority the query job should be run as. Can be either “BATCH” or “INTERACTIVE”. Defaults to “BATCH”. Use the QueryPriority enum to avoid typos.

  • **kwargs – Additional keyword arguments are passed to the constructor of the Google BigQuery Client (see documentation for options).

Note

Typical use cases would be to create, move, alter, or delete tables. As such, the possibility to route the output of the query into a destination table is not foreseen.

__call__(query, **kwargs)[source]

Run a query that does not return anything on Google BigQuery.

Parameters:
  • query (str) – The SQL query to fire using the pre-configured client.

  • **kwargs – Additional keyword arguments are passed to the constructor of the Google BigQuery QueryJobConfig. See documentation for options.

Returns:

If the job finishes without errors, an empty tuple is returned.

Return type:

tuple

Raises:

GbqError – If the QueryJob finishes and returns and error.

class GbqQuery2GcsParquet(project, bucket, prefix='', overwrite=False, skip=False, polling_interval=5, priority='BATCH', gbq_kws=None, gcs_kws=None)[source]

Bases: ArgRepr

Export the results of an SQL query to a bucket on Google Cloud Storage.

The SQL query will be fired against Google BigQuery. In essence, an EXPORT clause with the specified parameters interpolated is inserted into the query, after the last semicolon, before everything else. As such, no semicolon-separated sub-queries are allowed, but variable declaration and setting is fine. Results will be saved as multiple, sequentially numbered, snappy-compressed parquet files.

Parameters:
  • project (str) – The name of the Google BigQuery billing project.

  • bucket (str) – The Google Cloud Storage bucket to export to. Note that, unless you have set up some different “wiring” yourself, the project of the bucket is the same as project.

  • prefix (str, optional) – Prefix of the blob location, where parquet files will reside. If none is given here, one can be provided on calling the instance. If both are given, they will be concatenated and, if neither is given, a UUID will be generated. Defaults to an empty string.

  • overwrite (bool, optional) – Blobs with the given bucket/prefix combination may already exist on Google Cloud Storage. If True these are overwritten, else an exception is raised. Defaults to False.

  • skip (bool, optional) – Blobs with the given bucket/prefix combination may already exist on Google Cloud Storage. If that is the case, and skip is True, nothing will be done at all. Defaults to False

  • polling_interval (int, optional) – Job completion is going to be checked for every polling_interval seconds. Defaults to 5 (seconds).

  • priority (str, optional) – Priority the query job should be run as. Can be either “BATCH” or “INTERACTIVE”. Defaults to “BATCH”. Use the QueryPriority enum to avoid typos.

  • gbq_kws (dict, optional) – Additional keyword arguments to passed to the Google BigQuery client. Defaults to None, which results in an empty dictionary.

  • gcs_kws (dict, optional) – Additional keyword arguments to passed to the Google Storage client. Defaults to None, which results in an empty dictionary.

__call__(query, prefix='', **kwargs)[source]

Export the results of a SQL query to Google Cloud Storage.

Parameters:
  • query (str) – The SQL query to fire using the pre-configured client.

  • prefix (str, optional) – Prefix of the blob location, where parquet files will reside. If none is given, the prefix specified at instantiation will be used. If both are given, they will be concatenated and, if neither is given, a UUID will be generated. Defaults to an empty string.

  • **kwargs – Additional keyword arguments are passed to the constructor of the Google BigQuery QueryJobConfig. See documentation for options.

Returns:

If the query finishes without errors, the given or generated prefix is returned, so that blobs with the exported data can be retrieved.

Return type:

str

Raises:

GbqError – If the submitted QueryJob finishes and returns and error.

class GcsDir2LocalDir(project, bucket, prefix='', base_dir='/tmp', overwrite=False, skip=False, n_threads=16, chunk_size=10, **kwargs)[source]

Bases: ArgRepr

Download files from Google Cloud Storage to local directory.

Parameters:
  • project (str) – Project where the bucket and blobs reside.

  • bucket (str) – Bucket where the blobs reside.

  • prefix (str, optional) – The prefix of the blobs to download. Since it (or part of it) can also be provided later, when the callable instance is called, it is optional here. Defaults to an empty string.

  • base_dir (str, optional) – Absolute path to a base directory on the local filesystem. Defaults to “/tmp”.

  • overwrite (bool, optional) – Whether to silently overwrite local destination directory. Defaults to False, which will raise an exception if it already exists.

  • skip (bool, optional) – Whether to simply return the list of files found in the local destination directory, if it and any exists. Defaults to False.

  • n_threads (int, optional) – Maximum number of blobs to download in parallel. Defaults to 16.

  • chunk_size (int, optional) – Chunk size to read from Google Cloud Storage in one API call in MiB. Defaults to 10 MiB.

  • **kwargs – Additional keyword arguments are passed to the constructor of the Google Storage Client (see documentation for options).

Note

Blobs that have any more forward slashes in their name than the instantiation and call prefixes combined, i.e., that reside in virtual “subdirectories”, are ignored and are not downloaded.

__call__(prefix='')[source]

Download files from Google Cloud Storage to local drive.

Parameters:

prefix (str, optional) – The prefix of the files to download. If given here, it will be appended to the prefix given at instantiation time. Defaults to an empty string.

Returns:

A list of the fully resolved file names on the local drive.

Return type:

list

Raises:

FileExistsError – If overwrite is False and download into an existing folder was attempted.

property chunk_bytes

Bytes to read from Google Cloud Storage in one API call.

class GcsParquet2DataFrame(project, bucket, prefix='', n_threads=16, chunk_size=10, **kwargs)[source]

Bases: ArgRepr

Load parquet files from Google Cloud Storage into a pandas dataframe.

Parameters:
  • project (str) – Project where the bucket and parquet files reside.

  • bucket (str) – Bucket where the parquet files reside.

  • prefix (str, optional) – The prefix of the parquet files to download. Since it (or part of it) can also be provided later, when the callable instance is called, it is optional here. Defaults to an empty string.

  • n_threads (int, optional) – Maximum number of parquet files to download in parallel. Defaults to 16.

  • chunk_size (int, optional) – Chunk size to read from Google Cloud Storage in one API call in MiB. Defaults to 10 MiB.

  • **kwargs – Additional keyword arguments are passed to the constructor of the Google Storage Client (see documentation for options).

__call__(prefix='')[source]

Load parquet files from Google Cloud Storage into pandas DataFrame.

Parameters:

prefix (str, optional) – The prefix of the parquet files to load. If given here, it will be appended to the prefix given at instantiation time. Defaults to an empty string.

Returns:

Concatenated parquet files.

Return type:

DataFrame

property chunk_bytes

Bytes to read from Google Cloud Storage in one API call.

class GbqQuery2DataFrame(project, **kwargs)[source]

Bases: ArgRepr

Partial of the read_gbq function in the pandas_gbq package.

As such, it may not bet suitable for downloading large amounts of data (see its documentation for details).

Parameters:
  • project (str) – The project to bill for the query/retrieval.

  • **kwargs – Additional keyword arguments passed on to the read_gbq call.

__call__(query_or_table)[source]

Read Google BigQuery SQL results or table into a pandas DataFrame.

Parameters:

query_or_table (str) – Table name (including dataset id) or SQL query to be retrieved from or submitted to Google BigQuery.

Returns:

The contents of the table or the results of the SQL query.

Return type:

DataFrame

class DataFrame2Gbq(project, dataset, table='', location='', if_exists='fail', chunksize=None, **kwargs)[source]

Bases: ArgRepr

Partial of the to_gbq function in the pandas_gbq package.

As such, it may not be suitable for uploading large amounts of data (see its documentation for details).

Parameters:
  • project (str) – The project to bill for the upload.

  • dataset (str) – The id of the dataset where the destination table resides.

  • table (str, optional) – The name of the table to load data into (excluding the dataset) or the prefix to it. Defaults to an empty string but will be appended by the table name (or suffix) given on calling instances.

  • location (str, optional) – The physical datacenter location to load the data onto. See the Google Cloud Platform documentation for options. If not given, non-existing tables will be created in the default location of the dataset. Defaults to an empty string.

  • if_exists (str, optional) – What to do if the destination table already exists. Can be one of “fail”, “replace” or “append”. Defaults to “fail”. Use the IfExists enum to specify explicitly.

  • chunksize (int, optional) – Number of dataframe rows to be inserted in each chunk. Defaults to None, resulting in the entire dataframe to be inserted at once.

  • **kwargs – Additional keyword arguments passed on to the to_gbq call.

See also

IfExists

__call__(df, table='')[source]

Write a pandas DataFrame to a Google BigQuery table.

Parameters:
  • df (DataFrame) – The dataframe to upload to a Google BigQuery table.

  • table (str, optional) – The name (or suffix) of the table to load data into (excluding the dataset), to be appended to the table given at instantiation. Defaults to an empty string.

Returns:

An empty tuple.

Return type:

tuple

Raises:

GbqError – If no table was given, neither at instantiation nor when called.

Enums

class Collation(*values)[source]

Bases: StrEnum

Specify string sorting in tables in a Google BigQuery dataset.

SENSITIVE = ''
INSENSITIVE = und:ci
class Rounding(*values)[source]

Bases: StrEnum

Specify rounding in tables in a Google BigQuery dataset.

HALF_AWAY = ROUND_HALF_AWAY_FROM_ZERO
HALF_EVEN = ROUND_HALF_EVEN
class Billing(*values)[source]

Bases: StrEnum

Specify storage billing mode of tables in a Google BigQuery dataset.

PHYSICAL = PHYSICAL
LOGICAL = LOGICAL
class IfExists(*values)[source]

Bases: StrEnum

Specify what to do if the BigQuery table to write to already exists.

FAIL = fail
REPLACE = replace
APPEND = append