apache_beam.io.gcp.bigquery_tools module

Tools used by BigQuery sources and sinks.

Classes, constants and functions in this file are experimental and have no backwards compatibility guarantees.

These tools include wrappers and clients to interact with BigQuery APIs.

NOTHING IN THIS FILE HAS BACKWARDS COMPATIBILITY GUARANTEES.

class apache_beam.io.gcp.bigquery_tools.ExportFileFormat[source]

Bases: object

CSV = 'CSV'
JSON = 'NEWLINE_DELIMITED_JSON'
AVRO = 'AVRO'
class apache_beam.io.gcp.bigquery_tools.ExportCompression[source]

Bases: object

GZIP = 'GZIP'
DEFLATE = 'DEFLATE'
SNAPPY = 'SNAPPY'
NONE = 'NONE'
apache_beam.io.gcp.bigquery_tools.default_encoder(obj)[source]
apache_beam.io.gcp.bigquery_tools.get_hashable_destination(destination)[source]

Parses a table reference into a (project, dataset, table) tuple.

Parameters:destination – Either a TableReference object from the bigquery API. The object has the following attributes: projectId, datasetId, and tableId. Or a string representing the destination containing ‘PROJECT:DATASET.TABLE’.
Returns:A string representing the destination containing ‘PROJECT:DATASET.TABLE’.
apache_beam.io.gcp.bigquery_tools.parse_table_schema_from_json(schema_string)[source]

Parse the Table Schema provided as string.

Parameters:schema_string – String serialized table schema, should be a valid JSON.
Returns:A TableSchema of the BigQuery export from either the Query or the Table.
apache_beam.io.gcp.bigquery_tools.parse_table_reference(table, dataset=None, project=None)[source]

Parses a table reference into a (project, dataset, table) tuple.

Parameters:
  • table – The ID of the table. The ID must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_). If dataset argument is None then the table argument must contain the entire table reference: ‘DATASET.TABLE’ or ‘PROJECT:DATASET.TABLE’. This argument can be a bigquery.TableReference instance in which case dataset and project are ignored and the reference is returned as a result. Additionally, for date partitioned tables, appending ‘$YYYYmmdd’ to the table name is supported, e.g. ‘DATASET.TABLE$YYYYmmdd’.
  • dataset – The ID of the dataset containing this table or null if the table reference is specified entirely by the table argument.
  • project – The ID of the project containing this table or null if the table reference is specified entirely by the table (and possibly dataset) argument.
Returns:

A TableReference object from the bigquery API. The object has the following attributes: projectId, datasetId, and tableId.

Raises:

ValueError – if the table reference as a string does not match the expected format.

class apache_beam.io.gcp.bigquery_tools.BigQueryWrapper(client=None)[source]

Bases: object

BigQuery client wrapper with utilities for querying.

The wrapper is used to organize all the BigQuery integration points and offer a common place where retry logic for failures can be controlled. In addition it offers various functions used both in sources and sinks (e.g., find and create tables, query a table, etc.).

TEMP_TABLE = 'temp_table_'
TEMP_DATASET = 'temp_dataset_'
unique_row_id

Returns a unique row ID (str) used to avoid multiple insertions.

If the row ID is provided, BigQuery will make a best effort to not insert the same row multiple times for fail and retry scenarios in which the insert request may be issued several times. This comes into play for sinks executed in a local runner.

Returns:a unique row ID string
get_query_location(project_id, query, use_legacy_sql)[source]

Get the location of tables referenced in a query.

This method returns the location of the first referenced table in the query and depends on the BigQuery service to provide error handling for queries that reference tables in multiple locations.

wait_for_bq_job(job_reference, sleep_duration_sec=5, max_retries=60)[source]

Poll job until it is DONE.

Parameters:
  • job_reference – bigquery.JobReference instance.
  • sleep_duration_sec – Specifies the delay in seconds between retries.
  • max_retries – The total number of times to retry. If equals to 0, the function waits forever.
Raises:

RuntimeError – If the job is FAILED or the number of retries has been reached.

get_table(project_id, dataset_id, table_id)[source]

Lookup a table’s metadata object.

Parameters:
  • client – bigquery.BigqueryV2 instance
  • project_id – table lookup parameter
  • dataset_id – table lookup parameter
  • table_id – table lookup parameter
Returns:

bigquery.Table instance

Raises:

HttpError – if lookup failed.

get_or_create_dataset(project_id, dataset_id, location=None)[source]
get_table_location(project_id, dataset_id, table_id)[source]
create_temporary_dataset(project_id, location)[source]
clean_up_temporary_dataset(project_id)[source]
get_job(project, job_id, location=None)[source]
perform_load_job(destination, files, job_id, schema=None, write_disposition=None, create_disposition=None, additional_load_parameters=None)[source]

Starts a job to load data into BigQuery.

Returns:bigquery.JobReference with the information about the job that was started.
perform_extract_job(destination, job_id, table_reference, destination_format, include_header=True, compression='NONE')[source]

Starts a job to export data from BigQuery.

Returns:bigquery.JobReference with the information about the job that was started.
get_or_create_table(project_id, dataset_id, table_id, schema, create_disposition, write_disposition, additional_create_parameters=None)[source]

Gets or creates a table based on create and write dispositions.

The function mimics the behavior of BigQuery import jobs when using the same create and write dispositions.

Parameters:
  • project_id – The project id owning the table.
  • dataset_id – The dataset id owning the table.
  • table_id – The table id.
  • schema – A bigquery.TableSchema instance or None.
  • create_disposition – CREATE_NEVER or CREATE_IF_NEEDED.
  • write_disposition – WRITE_APPEND, WRITE_EMPTY or WRITE_TRUNCATE.
Returns:

A bigquery.Table instance if table was found or created.

Raises:

RuntimeError – For various mismatches between the state of the table and the create/write dispositions passed in. For example if the table is not empty and WRITE_EMPTY was specified then an error will be raised since the table was expected to be empty.

run_query(project_id, query, use_legacy_sql, flatten_results, dry_run=False)[source]
insert_rows(project_id, dataset_id, table_id, rows, insert_ids=None, skip_invalid_rows=False)[source]

Inserts rows into the specified table.

Parameters:
  • project_id – The project id owning the table.
  • dataset_id – The dataset id owning the table.
  • table_id – The table id.
  • rows – A list of plain Python dictionaries. Each dictionary is a row and each key in it is the name of a field.
  • skip_invalid_rows – If there are rows with insertion errors, whether they should be skipped, and all others should be inserted successfully.
Returns:

A tuple (bool, errors). If first element is False then the second element will be a bigquery.InserttErrorsValueListEntry instance containing specific errors.

convert_row_to_dict(row, schema)[source]

Converts a TableRow instance using the schema to a Python dict.

class apache_beam.io.gcp.bigquery_tools.BigQueryReader(source, test_bigquery_client=None, use_legacy_sql=True, flatten_results=True, kms_key=None)[source]

Bases: apache_beam.runners.dataflow.native_io.iobase.NativeSourceReader

A reader for a BigQuery source.

class apache_beam.io.gcp.bigquery_tools.BigQueryWriter(sink, test_bigquery_client=None, buffer_size=None)[source]

Bases: apache_beam.runners.dataflow.native_io.iobase.NativeSinkWriter

The sink writer for a BigQuerySink.

Write(row)[source]
class apache_beam.io.gcp.bigquery_tools.RowAsDictJsonCoder[source]

Bases: apache_beam.coders.coders.Coder

A coder for a table row (represented as a dict) to/from a JSON string.

This is the default coder for sources and sinks if the coder argument is not specified.

encode(table_row)[source]
decode(encoded_table_row)[source]
class apache_beam.io.gcp.bigquery_tools.RetryStrategy[source]

Bases: object

RETRY_ALWAYS = 'RETRY_ALWAYS'
RETRY_NEVER = 'RETRY_NEVER'
RETRY_ON_TRANSIENT_ERROR = 'RETRY_ON_TRANSIENT_ERROR'
static should_retry(strategy, error_message)[source]
class apache_beam.io.gcp.bigquery_tools.AppendDestinationsFn(destination)[source]

Bases: apache_beam.transforms.core.DoFn

Adds the destination to an element, making it a KV pair.

Outputs a PCollection of KV-pairs where the key is a TableReference for the destination, and the value is the record itself.

Experimental; no backwards compatibility guarantees.

process(element, *side_inputs)[source]