apache_beam.io.gcp.bigquery_file_loads module¶
Functionality to perform file loads into BigQuery for Batch and Streaming pipelines.
This source is able to work around BigQuery load quotas and limitations. When destinations are dynamic, or when data for a single job is too large, the data will be split into multiple jobs.
NOTHING IN THIS FILE HAS BACKWARDS COMPATIBILITY GUARANTEES.
-
apache_beam.io.gcp.bigquery_file_loads.
file_prefix_generator
(with_validation=True, pipeline_gcs_location=None, temp_location=None)[source]¶
-
class
apache_beam.io.gcp.bigquery_file_loads.
WriteRecordsToFile
(schema, max_files_per_bundle=20, max_file_size=4398046511104, file_format=None)[source]¶ Bases:
apache_beam.transforms.core.DoFn
Write input records to files before triggering a load job.
This transform keeps up to
max_files_per_bundle
files open to write to. It receives (destination, record) tuples, and it writes the records to different files for each destination.If there are more than
max_files_per_bundle
destinations that we need to write to, then those records are grouped by their destination, and later written to files byWriteGroupedRecordsToFile
.It outputs two PCollections.
Initialize a
WriteRecordsToFile
.Parameters: -
UNWRITTEN_RECORD_TAG
= 'UnwrittenRecords'¶
-
WRITTEN_FILE_TAG
= 'WrittenFiles'¶
-
-
class
apache_beam.io.gcp.bigquery_file_loads.
WriteGroupedRecordsToFile
(schema, max_file_size=4398046511104, file_format=None)[source]¶ Bases:
apache_beam.transforms.core.DoFn
Receives collection of dest-iterable(records), writes it to files.
This is different from
WriteRecordsToFile
because it receives records grouped by destination. This means that it’s not necessary to keep multiple file descriptors open, because we know for sure when records for a single destination have been written out.Experimental; no backwards compatibility guarantees.
-
class
apache_beam.io.gcp.bigquery_file_loads.
UpdateDestinationSchema
(write_disposition=None, test_client=None, additional_bq_parameters=None, step_name=None, source_format=None)[source]¶ Bases:
apache_beam.transforms.core.DoFn
Update destination schema based on data that is about to be copied into it.
Unlike load and query jobs, BigQuery copy jobs do not support schema field addition or relaxation on the destination table. This DoFn fills that gap by updating the destination table schemas to be compatible with the data coming from the source table so that schema field modification options are respected regardless of whether data is loaded directly to the destination table or loaded into temporary tables before being copied into the destination.
This tranform takes as input a (destination, job_reference) pair where the job_reference refers to a completed load job into a temporary table.
This transform emits (destination, job_reference) pairs where the job_reference refers to a submitted load job for performing the schema modification. Note that the input and output job references are not the same.
Experimental; no backwards compatibility guarantees.
-
class
apache_beam.io.gcp.bigquery_file_loads.
TriggerCopyJobs
(create_disposition=None, write_disposition=None, test_client=None, step_name=None)[source]¶ Bases:
apache_beam.transforms.core.DoFn
Launches jobs to copy from temporary tables into the main target table.
When a job needs to write to multiple destination tables, or when a single destination table needs to have multiple load jobs to write to it, files are loaded into temporary tables, and those tables are later copied to the destination tables.
This transform emits (destination, job_reference) pairs.
- TODO(BEAM-7822): In file loads method of writing to BigQuery,
- copying from temp_tables to destination_table is not atomic. See: https://issues.apache.org/jira/browse/BEAM-7822
-
class
apache_beam.io.gcp.bigquery_file_loads.
TriggerLoadJobs
(schema=None, create_disposition=None, write_disposition=None, test_client=None, temporary_tables=False, additional_bq_parameters=None, source_format=None, step_name=None)[source]¶ Bases:
apache_beam.transforms.core.DoFn
Triggers the import jobs to BQ.
Experimental; no backwards compatibility guarantees.
-
TEMP_TABLES
= 'TemporaryTables'¶
-
-
class
apache_beam.io.gcp.bigquery_file_loads.
PartitionFiles
(max_partition_size, max_files_per_partition)[source]¶ Bases:
apache_beam.transforms.core.DoFn
-
MULTIPLE_PARTITIONS_TAG
= 'MULTIPLE_PARTITIONS'¶
-
SINGLE_PARTITION_TAG
= 'SINGLE_PARTITION'¶
-
-
class
apache_beam.io.gcp.bigquery_file_loads.
WaitForBQJobs
(test_client=None)[source]¶ Bases:
apache_beam.transforms.core.DoFn
Takes in a series of BQ job names as side input, and waits for all of them.
If any job fails, it will fail. If all jobs succeed, it will succeed.
Experimental; no backwards compatibility guarantees.
-
class
apache_beam.io.gcp.bigquery_file_loads.
BigQueryBatchFileLoads
(destination, schema=None, custom_gcs_temp_location=None, create_disposition=None, write_disposition=None, triggering_frequency=None, with_auto_sharding=False, temp_file_format=None, max_file_size=None, max_files_per_bundle=None, max_partition_size=None, max_files_per_partition=None, additional_bq_parameters=None, table_side_inputs=None, schema_side_inputs=None, test_client=None, validate=True, is_streaming_pipeline=False)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Takes in a set of elements, and inserts them to BigQuery via batch loads.
-
DESTINATION_JOBID_PAIRS
= 'destination_load_jobid_pairs'¶
-
DESTINATION_FILE_PAIRS
= 'destination_file_pairs'¶
-
DESTINATION_COPY_JOBID_PAIRS
= 'destination_copy_jobid_pairs'¶
-
COUNT
= 0¶
-