gcp

Tools to interact with elements of the Google Cloud Project (GCP).

Specifically, data scientists tend to interact mostly with Google’s BigQuery (GBQ) data-warehouse solution and the Google Cloud Storage (GCS).

class Gcs(project, *args, **kwargs)[source]

Bases: ArgRepr

Wraps a Google Cloud Storage (GCS) client for delayed instantiation.

For more convenient function chaining in functional compositions, instances are also callable and simply return a new client when called.

Parameters:
  • project (str) – The name of the project the client should act on.

  • *args – Additional arguments to be passed to the client init.

  • **kwargs – Additional keyword arguments to be passed to the client init. See the reference for options.

__call__(*_, **__)[source]

New GCS client on every call, ignoring any (keyword) arguments.

property client

New GCS client on first request, cached for subsequent requests.

class Gbq(project, *args, **kwargs)[source]

Bases: ArgRepr

Wraps a Google BigQuery (GBQ) client for delayed instantiation.

For more convenient function chaining in functional compositions, instances are also callable and simply return a new client when called.

Parameters:
  • project (str) – The name of the project the client should act on.

  • *args – Additional arguments to be passed to the client init.

  • **kwargs – Additional keyword arguments to be passed to the client init. See the client documentation for options.

__call__(*_, **__)[source]

New GBQ client on every call, ignoring any (keyword) arguments.

property client

New GBQ client on first request, cached for subsequent requests.

class ParquetLoadJobConfig(**kwargs)[source]

Bases: ArgRepr

A LoadJobConfig with source format locked to PARQUET.

All other options, including parquet-specific ones, can be set freely via keyword arguments. See API reference for options.

Parameters:

**kwargs – Keyword arguments passed on to LoadJobConfig. The source_format argument is ignored if given.

__call__(*_, **__)[source]

New LoadJobConfig on every call, ignoring any arguments.

class GbqDataset(gbq, dataset, location='europe-north1', exists_ok=False, name=None, description=None, table_expire_days=None, partition_expire_days=None, labels=None, access=None, case_sensitive=True, collation=None, rounding=None, max_travel_time_hours=168, billing=None, tags=None)[source]

Bases: object

Create a new dataset in a Google BigQuery project.

Parameters:
  • gbq (Gbq) – An instance of a wrapped GBQ client.

  • dataset (str) – The identifier of the dataset to create. Only letters, numbers, and underscores are permitted.

  • location (str, optional) – The physical datacenter location to create the dataset in. See the Google Cloud Platform documentation for options. Defaults to “europe-north1”.

  • exists_ok (bool, optional) – Whether to quietly return the requested dataset if it exists or raise an exception. Defaults to False.

  • name (str, optional) – A human-readable name of the dataset. Defaults to the dataset.

  • description (str, optional) – A short description of the dataset. Defaults to None.

  • table_expire_days (int, optional) – Number of days after which tables are dropped. Defaults to None, which results in tables never being dropped.

  • partition_expire_days (int, optional) – Number of days after which partitions of partitioned tables are dropped. Defaults to None, which results in partitions never being dropped.

  • labels (dict, optional) – Any number of string-valued labels of the dataset. Defaults to None.

  • access (list of dict, optional) – Fined-grained access rights to the dataset (see the Google Cloud Platform documentation for details). If not given, defaults access rights are set by Google BigQuery.

  • case_sensitive (bool, optional) – Whether dataset and table names should be case-sensitive or not. Defaults to True.

  • collation (str, optional) – Default collation mode for string sorting in string columns of tables. Defaults to None, which results in case-sensitive behavior. Use the Collation enum to specify explicitly.

  • rounding (str, optional) – Default rounding mode. Defaults to None, which lets Google BigQuery choose. Use the Rounding enum to specify explicitly.

  • max_travel_time_hours (int, optional) – Define duration of Google Bigquery’s “time travel” window in hours, i.e., for how long changes can be rolled back and tables can be queried “as of” some previous time. Values can be between 48 and 168 hours (2 to 7 days). Defaults to 168.

  • billing (str, optional) – Default billing mode for tables. Defaults to None, which lets Google BigQuery choose. Use the Billing enum to specify explicitly.

  • tags (dict, optional) – Associate globally defined tags with this dataset. Defaults to None, which result in no tags to be associated.

Raises:
  • AttributeError – If dataset, location, name, or description are not strings.

  • TypeError – If any of table_expire_days, partition_expire_days, or max_travel_time_hours cannot be cast to an integer.

  • ValueError – If any of table_expire_days, partition_expire_days, or max_travel_time_hours are less than one and if any of collation, rounding or billing are not allowed options.

Note

Options for linked and external dataset sources, as well as for encryption configuration are deliberately omitted. You probably should not play with those without consulting your organization’s data engineers.

__call__(*_, **__)[source]

Create a Google BigQuery dataset in a Google Cloud Platform project.

If the dataset already exists and exists_ok is True, it is returned unchanged, that is, none of the specified options are applied.

Raises:

GbqError – If exists_ok is set to False and the dataset already exists.

Returns:

  • str – The name of the existing or newly created dataset.

  • boolTrue if the requested dataset is newly created and False if an existing dataset is returned.

property api_repr

Payload for the API call to the Google Cloud Platform.

static to_ms(days)[source]

Convert integer days to millisecond string for the GCP API call.

class GcsBucket(gcs, bucket, location='EUROPE-NORTH1', exists_ok=False, age=None, user_project=None, requester_pays=False, **kwargs)[source]

Bases: ArgRepr

Create/retrieve and configure a bucket on Google Cloud Storage (GCS).

Parameters:
  • gcs (Gcs) – An instance of a wrapped GCS client.

  • bucket (str) – The (unique!) name of the bucket to create. May include any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called.

  • location (str, optional) – The physical datacenter location to create the bucket in. See the Google Cloud Platform documentation for options. Defaults to “EUROPE-NORTH1”.

  • exists_ok (bool, optional) – Whether quietly return the requested bucket if it exists or raise an. exception. Defaults to False.

  • age (int, optional) – Defaults to None. If set, blobs older than the specified number of days will be automatically deleted.

  • user_project (str, optional) – The project billed for interacting with the bucket. Defaults to the project carried by the gcs client.

  • requester_pays (bool, optional) – Whether the requester will be billed for interacting with the bucket. Defaults to False, meaning that the (user_)`project` pays.

  • **kwargs – Any bucket property to set/change on the created/retrieved bucket. See the GCS docs for all options.

Raises:
  • AttributeError – If bucket or location are not strings.

  • TypeError – If age cannot be cast to an integer.

  • ValueError – If age is less than one.

See also

Gcs

__call__(*parts)[source]

Create/retrieve and configure a bucket on/from GCS.

Parameters:

*parts (str) – Fragments that will be interpolated into the bucket given at instantiation. Obviously, there must be at least as many as there are placeholders in the bucket.

Returns:

  • str – The name of the existing or newly created bucket.

  • boolTrue if the requested bucket is newly created and False if an existing bucket is returned.

Raises:

GcsError – If exists_ok is set to False but the bucket already exists or if you try to set an invalid bucket property from the kwargs.

property lifecycle

Minimal configuration for adding a life-cycle rule (if required).

class GbqQuery(gbq, config=None, polling_interval=5)[source]

Bases: ArgRepr

Run a SQL query that does not return anything on Google BigQuery.

Suitable for DDL statements (CREATE, ALTER, DROP) and DML statements (INSERT, UPDATE, DELETE) where the result set is not needed.

Parameters:
  • gbq (Gbq) – An instance of a wrapped GBQ client.

  • config (QueryJobConfig | None, optional) – An instance of QueryJobConfig (see the QueryJobConfig docs). If None (the default), the default config will be used.

  • polling_interval (int, optional) – Job completion is going to be checked for every polling_interval seconds. Defaults to 5 (seconds).

Raises:
  • TypeError – Iif polling_interval cannot be cast to float.

  • ValueError – If polling_interval is smaller than 1.

See also

Gbq

__call__(query)[source]

Run a query that does not return anything on Google BigQuery.

Parameters:

query (str) – The SQL query to execute against BigQuery (typically DDL or DML).

Returns:

If the job finishes without errors, an empty tuple is returned.

Return type:

tuple

Raises:

GbqError – If the query execution failed for some reason.

class GbqQuery2GcsParquet(gbq, path='{}', overwrite=False, skip=False, config=None, polling_interval=5, **kwargs)[source]

Bases: ArgRepr

Export the results of an SQL query to a bucket on Google Cloud Storage.

The SQL query will be fired against Google BigQuery. In essence, an EXPORT clause with the specified parameters interpolated is inserted into the query, after the last semicolon, before everything else. As such, no semicolon-separated sub-queries are allowed, but variable declaration and setting is fine. Results will be saved as multiple, sequentially numbered, snappy-compressed parquet files.

Parameters:
  • gbq (Gbq) – An instance of a wrapped GBQ client.

  • path (str, optional) – The path to the cloud storage “directory”, where parquet files will reside in “bucket/prefix/” form. May contain any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called. If the prefix part is empty after interpolation, a randomly generated UUID will be used. Defaults to “{}”.

  • overwrite (bool, optional) – Blobs with the given bucket/prefix combination may already exist on Google Cloud Storage. If True these are overwritten, else an exception is raised. Defaults to False.

  • skip (bool, optional) – Blobs with the given bucket/prefix combination may already exist on Google Cloud Storage. If that is the case, and skip is True, nothing will be done at all. Defaults to False

  • config (QueryJobConfig | None, optional) – An instance of QueryJobConfig (see the API reference docs). If None (the default), the default config will be used.

  • polling_interval (int, optional) – Job completion is going to be checked for every polling_interval seconds. Defaults to 5 (seconds).

  • **kwargs – Additional keyword arguments are passed to the constructor of the Google Cloud Storage (GCS) client, overwriting common options that are plucked from the Google BigQuery client.

Raises:
  • TypeError – If path is not a string or polling_interval cannot be cast to float.

  • ValueError – If path is empty after sanitation or polling_interval is < 1.

__call__(query, *parts)[source]

Export the results of a SQL query to Google Cloud Storage.

Parameters:
  • query (str) – The SQL query to fire using the pre-configured client.

  • *parts (str) – Fragments that will be interpolated into the path given at instantiation. Obviously, there must be at least as many as there are placeholders in the path.

Returns:

If the query finishes without errors, the given or generated prefix is returned, so that blobs with the exported data can be retrieved.

Return type:

str

Raises:
  • ValueError – If query is empty of if path is empty after inserting parts.

  • FileExistsError – If overwrite is set to True and files with the given prefix already exists in the given bucket.

  • GbqError – If the submitted QueryJob finishes and returns and error.

property flag

Stringified version of the otherwise boolean overwrite option.

property options

Options used for the Google Cloud Storage client.

class GbqQuery2DataFrame(gbq, bears='pandas', config=None, polling_interval=5)[source]

Bases: ArgRepr

Results of a Google BigQuery SQL query as a pandas or polars dataframe

Suitable for small to medium result sets that fit comfortably in memory. For large result sets, consider exporting to Google Cloud Storage and loading files from there instead.

Parameters:
  • gbq (Gbq) – An instance of a wrapped GBQ client.

  • bears (Bears, optional) – Type of dataframe to return. Can be one of “pandas” or “polars”. Use the Bears enum to avoid typos. Defaults to “pandas”.

  • config (QueryJobConfig | None, optional) – An instance of QueryJobConfig (see the ` API docs <https://docs.cloud.google.com/python/docs/reference/bigquery/ latest/google.cloud.bigquery.job.QueryJobConfig>`_). If None (the default), the default config will be used.

  • polling_interval (int, optional) – Job completion is going to be checked for every polling_interval seconds. Defaults to 5 (seconds).

Raises:
  • TypeError – Iif polling_interval cannot be cast to float.

  • ValueError – If bears is not “pandas” or “polars” or if polling_interval is smaller than 1.

__call__(query)[source]

Read Google BigQuery SQL results or table into a pandas DataFrame.

Parameters:

query (str) – The SQL query to execute against BigQuery.

Returns:

The results of the SQL query in the requested dataframe type.

Return type:

DataFrame

Raises:

GbqError – If the query execution failed for some reason.

class DataFrame2Gbq(gbq, dataset, table='{}', location='europe-north1', config=None, polling_interval=5, **kwargs)[source]

Bases: ArgRepr

Upload a pandas or polars dataframe into a Google BigQuery table.

Not suitable for uploading large amounts of data. Fine

Parameters:
  • gbq (Gbq) – An instance of a wrapped GBQ client.

  • dataset (str) – The id of the dataset where the destination table resides.

  • table (str, optional) – The name of the table to load data into (excluding the dataset) or the prefix to it. May contain any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called. Defaults to “{}”.

  • location (str, optional) – The physical datacenter location to load the data onto. See the Google Cloud Platform documentation for options. If not given, non-existing tables will be created in the default location of the dataset. Defaults to an empty string.

  • config (ParquetLoadJobConfig | None, optional) – An instance of the wrapped LoadJobConfig with source_format locked to PARQUET. All other load job options can be set freely via that wrapper. If None (the default), a default config with only source_format set will be used.

  • polling_interval (int, optional) – Job completion is going to be checked for every polling_interval seconds. Defaults to 5 (seconds).

  • **kwargs – Additional keyword arguments passed on to the dataframe method that writes to parquet file.

Raises:
  • AttributeError – If location is not a string.

  • TypeError – If dataset or table are not strings or if polling_interval cannot be cast to float.

  • ValueError – If either dataset or table are emtpy strings or if polling_interval is smaller than 1.

__call__(df, *parts)[source]

Write a pandas DataFrame to a Google BigQuery table.

Parameters:
  • df (DataFrame) – The pandas or polars dataframe to upload into a BigQuery table.

  • *parts (str) – Fragments that will be interpolated into the table given at instantiation. Obviously, there must be at least as many as there are placeholders in the table.

Returns:

An empty tuple.

Return type:

tuple

Raises:

GbqError – If the upload failed for some reason.

Enums

class Collation(*values)[source]

Bases: StrEnum

Specify string sorting in tables in a Google BigQuery dataset.

SENSITIVE = ''
INSENSITIVE = und:ci
class Rounding(*values)[source]

Bases: StrEnum

Specify rounding in tables in a Google BigQuery dataset.

HALF_AWAY = ROUND_HALF_AWAY_FROM_ZERO
HALF_EVEN = ROUND_HALF_EVEN
class Billing(*values)[source]

Bases: StrEnum

Specify storage billing mode of tables in a Google BigQuery dataset.

PHYSICAL = PHYSICAL
LOGICAL = LOGICAL