gcp
Tools to interact with elements of the Google Cloud Project (GCP).
Specifically, data scientists tend to interact mostly with Google’s BigQuery (GBQ) data-warehouse solution and the Google Cloud Storage (GCS).
- class Gcs(project, *args, **kwargs)[source]
Bases:
ArgReprWraps a Google Cloud Storage (GCS) client for delayed instantiation.
For more convenient function chaining in functional compositions, instances are also callable and simply return a new client when called.
- Parameters:
project (str) – The name of the project the client should act on.
*args – Additional arguments to be passed to the client init.
**kwargs – Additional keyword arguments to be passed to the client init. See the reference for options.
- property client
New GCS client on first request, cached for subsequent requests.
- class Gbq(project, *args, **kwargs)[source]
Bases:
ArgReprWraps a Google BigQuery (GBQ) client for delayed instantiation.
For more convenient function chaining in functional compositions, instances are also callable and simply return a new client when called.
- Parameters:
project (str) – The name of the project the client should act on.
*args – Additional arguments to be passed to the client init.
**kwargs – Additional keyword arguments to be passed to the client init. See the client documentation for options.
- property client
New GBQ client on first request, cached for subsequent requests.
- class ParquetLoadJobConfig(**kwargs)[source]
Bases:
ArgReprA LoadJobConfig with source format locked to PARQUET.
All other options, including parquet-specific ones, can be set freely via keyword arguments. See API reference for options.
- Parameters:
**kwargs – Keyword arguments passed on to
LoadJobConfig. The source_format argument is ignored if given.
- class GbqDataset(gbq, dataset, location='europe-north1', exists_ok=False, name=None, description=None, table_expire_days=None, partition_expire_days=None, labels=None, access=None, case_sensitive=True, collation=None, rounding=None, max_travel_time_hours=168, billing=None, tags=None)[source]
Bases:
objectCreate a new dataset in a Google BigQuery project.
- Parameters:
gbq (Gbq) – An instance of a wrapped GBQ client.
dataset (str) – The identifier of the dataset to create. Only letters, numbers, and underscores are permitted.
location (str, optional) – The physical datacenter location to create the dataset in. See the Google Cloud Platform documentation for options. Defaults to “europe-north1”.
exists_ok (bool, optional) – Whether to quietly return the requested dataset if it exists or raise an exception. Defaults to
False.name (str, optional) – A human-readable name of the dataset. Defaults to the dataset.
description (str, optional) – A short description of the dataset. Defaults to
None.table_expire_days (int, optional) – Number of days after which tables are dropped. Defaults to
None, which results in tables never being dropped.partition_expire_days (int, optional) – Number of days after which partitions of partitioned tables are dropped. Defaults to
None, which results in partitions never being dropped.labels (dict, optional) – Any number of string-valued labels of the dataset. Defaults to
None.access (list of dict, optional) – Fined-grained access rights to the dataset (see the Google Cloud Platform documentation for details). If not given, defaults access rights are set by Google BigQuery.
case_sensitive (bool, optional) – Whether dataset and table names should be case-sensitive or not. Defaults to True.
collation (str, optional) – Default collation mode for string sorting in string columns of tables. Defaults to
None, which results in case-sensitive behavior. Use theCollationenum to specify explicitly.rounding (str, optional) – Default rounding mode. Defaults to
None, which lets Google BigQuery choose. Use theRoundingenum to specify explicitly.max_travel_time_hours (int, optional) – Define duration of Google Bigquery’s “time travel” window in hours, i.e., for how long changes can be rolled back and tables can be queried “as of” some previous time. Values can be between 48 and 168 hours (2 to 7 days). Defaults to 168.
billing (str, optional) – Default billing mode for tables. Defaults to
None, which lets Google BigQuery choose. Use theBillingenum to specify explicitly.tags (dict, optional) – Associate globally defined tags with this dataset. Defaults to
None, which result in no tags to be associated.
- Raises:
AttributeError – If dataset, location, name, or description are not strings.
TypeError – If any of table_expire_days, partition_expire_days, or max_travel_time_hours cannot be cast to an integer.
ValueError – If any of table_expire_days, partition_expire_days, or max_travel_time_hours are less than one and if any of collation, rounding or billing are not allowed options.
Note
Options for linked and external dataset sources, as well as for encryption configuration are deliberately omitted. You probably should not play with those without consulting your organization’s data engineers.
- __call__(*_, **__)[source]
Create a Google BigQuery dataset in a Google Cloud Platform project.
If the dataset already exists and exists_ok is
True, it is returned unchanged, that is, none of the specified options are applied.- Raises:
GbqError – If exists_ok is set to
Falseand the dataset already exists.- Returns:
str – The name of the existing or newly created dataset.
bool –
Trueif the requested dataset is newly created andFalseif an existing dataset is returned.
- property api_repr
Payload for the API call to the Google Cloud Platform.
- class GcsBucket(gcs, bucket, location='EUROPE-NORTH1', exists_ok=False, age=None, user_project=None, requester_pays=False, **kwargs)[source]
Bases:
ArgReprCreate/retrieve and configure a bucket on Google Cloud Storage (GCS).
- Parameters:
gcs (Gcs) – An instance of a wrapped GCS client.
bucket (str) – The (unique!) name of the bucket to create. May include any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called.
location (str, optional) – The physical datacenter location to create the bucket in. See the Google Cloud Platform documentation for options. Defaults to “EUROPE-NORTH1”.
exists_ok (bool, optional) – Whether quietly return the requested bucket if it exists or raise an. exception. Defaults to
False.age (int, optional) – Defaults to
None. If set, blobs older than the specified number of days will be automatically deleted.user_project (str, optional) – The project billed for interacting with the bucket. Defaults to the project carried by the gcs client.
requester_pays (bool, optional) – Whether the requester will be billed for interacting with the bucket. Defaults to
False, meaning that the (user_)`project` pays.**kwargs – Any bucket property to set/change on the created/retrieved bucket. See the GCS docs for all options.
- Raises:
AttributeError – If bucket or location are not strings.
TypeError – If age cannot be cast to an integer.
ValueError – If age is less than one.
See also
- __call__(*parts)[source]
Create/retrieve and configure a bucket on/from GCS.
- Parameters:
*parts (str) – Fragments that will be interpolated into the bucket given at instantiation. Obviously, there must be at least as many as there are placeholders in the bucket.
- Returns:
str – The name of the existing or newly created bucket.
bool –
Trueif the requested bucket is newly created andFalseif an existing bucket is returned.
- Raises:
GcsError – If exists_ok is set to
Falsebut the bucket already exists or if you try to set an invalid bucket property from the kwargs.
- property lifecycle
Minimal configuration for adding a life-cycle rule (if required).
- class GbqQuery(gbq, config=None, polling_interval=5)[source]
Bases:
ArgReprRun a SQL query that does not return anything on Google BigQuery.
Suitable for DDL statements (CREATE, ALTER, DROP) and DML statements (INSERT, UPDATE, DELETE) where the result set is not needed.
- Parameters:
gbq (Gbq) – An instance of a wrapped GBQ client.
config (QueryJobConfig | None, optional) – An instance of
QueryJobConfig(see the QueryJobConfig docs). IfNone(the default), the default config will be used.polling_interval (int, optional) – Job completion is going to be checked for every polling_interval seconds. Defaults to 5 (seconds).
- Raises:
TypeError – Iif polling_interval cannot be cast to
float.ValueError – If polling_interval is smaller than 1.
See also
- __call__(query)[source]
Run a query that does not return anything on Google BigQuery.
- Parameters:
query (str) – The SQL query to execute against BigQuery (typically DDL or DML).
- Returns:
If the job finishes without errors, an empty tuple is returned.
- Return type:
tuple
- Raises:
GbqError – If the query execution failed for some reason.
- class GbqQuery2GcsParquet(gbq, path='{}', overwrite=False, skip=False, config=None, polling_interval=5, **kwargs)[source]
Bases:
ArgReprExport the results of an SQL query to a bucket on Google Cloud Storage.
The SQL query will be fired against Google BigQuery. In essence, an
EXPORTclause with the specified parameters interpolated is inserted into the query, after the last semicolon, before everything else. As such, no semicolon-separated sub-queries are allowed, but variable declaration and setting is fine. Results will be saved as multiple, sequentially numbered, snappy-compressed parquet files.- Parameters:
gbq (Gbq) – An instance of a wrapped GBQ client.
path (str, optional) – The path to the cloud storage “directory”, where parquet files will reside in “bucket/prefix/” form. May contain any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called. If the prefix part is empty after interpolation, a randomly generated UUID will be used. Defaults to “{}”.
overwrite (bool, optional) – Blobs with the given bucket/prefix combination may already exist on Google Cloud Storage. If
Truethese are overwritten, else an exception is raised. Defaults toFalse.skip (bool, optional) – Blobs with the given bucket/prefix combination may already exist on Google Cloud Storage. If that is the case, and skip is
True, nothing will be done at all. Defaults toFalseconfig (QueryJobConfig | None, optional) – An instance of
QueryJobConfig(see the API reference docs). IfNone(the default), the default config will be used.polling_interval (int, optional) – Job completion is going to be checked for every polling_interval seconds. Defaults to 5 (seconds).
**kwargs – Additional keyword arguments are passed to the constructor of the Google Cloud Storage (GCS) client, overwriting common options that are plucked from the Google BigQuery client.
- Raises:
TypeError – If path is not a string or polling_interval cannot be cast to float.
ValueError – If path is empty after sanitation or polling_interval is < 1.
- __call__(query, *parts)[source]
Export the results of a SQL query to Google Cloud Storage.
- Parameters:
query (str) – The SQL query to fire using the pre-configured client.
*parts (str) – Fragments that will be interpolated into the path given at instantiation. Obviously, there must be at least as many as there are placeholders in the path.
- Returns:
If the query finishes without errors, the given or generated prefix is returned, so that blobs with the exported data can be retrieved.
- Return type:
str
- Raises:
ValueError – If query is empty of if path is empty after inserting parts.
FileExistsError – If overwrite is set to True and files with the given prefix already exists in the given bucket.
GbqError – If the submitted
QueryJobfinishes and returns and error.
- property flag
Stringified version of the otherwise boolean overwrite option.
- property options
Options used for the Google Cloud Storage client.
- class GbqQuery2DataFrame(gbq, bears='pandas', config=None, polling_interval=5)[source]
Bases:
ArgReprResults of a Google BigQuery SQL query as a pandas or polars dataframe
Suitable for small to medium result sets that fit comfortably in memory. For large result sets, consider exporting to Google Cloud Storage and loading files from there instead.
- Parameters:
gbq (Gbq) – An instance of a wrapped GBQ client.
bears (Bears, optional) – Type of dataframe to return. Can be one of “pandas” or “polars”. Use the
Bearsenum to avoid typos. Defaults to “pandas”.config (QueryJobConfig | None, optional) – An instance of
QueryJobConfig(see the ` API docs <https://docs.cloud.google.com/python/docs/reference/bigquery/ latest/google.cloud.bigquery.job.QueryJobConfig>`_). IfNone(the default), the default config will be used.polling_interval (int, optional) – Job completion is going to be checked for every polling_interval seconds. Defaults to 5 (seconds).
- Raises:
TypeError – Iif polling_interval cannot be cast to
float.ValueError – If bears is not “pandas” or “polars” or if polling_interval is smaller than 1.
See also
- __call__(query)[source]
Read Google BigQuery SQL results or table into a pandas DataFrame.
- Parameters:
query (str) – The SQL query to execute against BigQuery.
- Returns:
The results of the SQL query in the requested dataframe type.
- Return type:
DataFrame
- Raises:
GbqError – If the query execution failed for some reason.
- class DataFrame2Gbq(gbq, dataset, table='{}', location='europe-north1', config=None, polling_interval=5, **kwargs)[source]
Bases:
ArgReprUpload a pandas or polars dataframe into a Google BigQuery table.
Not suitable for uploading large amounts of data. Fine
- Parameters:
gbq (Gbq) – An instance of a wrapped GBQ client.
dataset (str) – The id of the dataset where the destination table resides.
table (str, optional) – The name of the table to load data into (excluding the dataset) or the prefix to it. May contain any number of string placeholders (i.e., pairs of curly brackets) that will be interpolated when instances are called. Defaults to “{}”.
location (str, optional) – The physical datacenter location to load the data onto. See the Google Cloud Platform documentation for options. If not given, non-existing tables will be created in the default location of the dataset. Defaults to an empty string.
config (ParquetLoadJobConfig | None, optional) – An instance of the wrapped
LoadJobConfigwithsource_formatlocked toPARQUET. All other load job options can be set freely via that wrapper. IfNone(the default), a default config with onlysource_formatset will be used.polling_interval (int, optional) – Job completion is going to be checked for every polling_interval seconds. Defaults to 5 (seconds).
**kwargs – Additional keyword arguments passed on to the dataframe method that writes to parquet file.
- Raises:
AttributeError – If location is not a string.
TypeError – If dataset or table are not strings or if polling_interval cannot be cast to
float.ValueError – If either dataset or table are emtpy strings or if polling_interval is smaller than 1.
See also
- __call__(df, *parts)[source]
Write a pandas DataFrame to a Google BigQuery table.
- Parameters:
df (DataFrame) – The pandas or polars dataframe to upload into a BigQuery table.
*parts (str) – Fragments that will be interpolated into the table given at instantiation. Obviously, there must be at least as many as there are placeholders in the table.
- Returns:
An empty tuple.
- Return type:
tuple
- Raises:
GbqError – If the upload failed for some reason.
Enums
- class Collation(*values)[source]
Bases:
StrEnumSpecify string sorting in tables in a Google BigQuery dataset.
- SENSITIVE = ''
- INSENSITIVE = und:ci