apache_beam.io.gcp.bigquery_file_loads module¶
Functionality to perform file loads into BigQuery for Batch and Streaming pipelines.
This source is able to work around BigQuery load quotas and limitations. When destinations are dynamic, or when data for a single job is too large, the data will be split into multiple jobs.
NOTHING IN THIS FILE HAS BACKWARDS COMPATIBILITY GUARANTEES.
-
apache_beam.io.gcp.bigquery_file_loads.
file_prefix_generator
(with_validation=True, pipeline_gcs_location=None, temp_location=None)[source]¶
-
class
apache_beam.io.gcp.bigquery_file_loads.
WriteRecordsToFile
(max_files_per_bundle=20, max_file_size=4398046511104, coder=None)[source]¶ Bases:
apache_beam.transforms.core.DoFn
Write input records to files before triggering a load job.
This transform keeps up to
max_files_per_bundle
files open to write to. It receives (destination, record) tuples, and it writes the records to different files for each destination.If there are more than
max_files_per_bundle
destinations that we need to write to, then those records are grouped by their destination, and later written to files byWriteGroupedRecordsToFile
.It outputs two PCollections.
Initialize a
WriteRecordsToFile
.Parameters: -
UNWRITTEN_RECORD_TAG
= 'UnwrittenRecords'¶
-
WRITTEN_FILE_TAG
= 'WrittenFiles'¶
-
-
class
apache_beam.io.gcp.bigquery_file_loads.
WriteGroupedRecordsToFile
(max_file_size=4398046511104, coder=None)[source]¶ Bases:
apache_beam.transforms.core.DoFn
Receives collection of dest-iterable(records), writes it to files.
This is different from
WriteRecordsToFile
because it receives records grouped by destination. This means that it’s not necessary to keep multiple file descriptors open, because we know for sure when records for a single destination have been written out.Experimental; no backwards compatibility guarantees.
-
class
apache_beam.io.gcp.bigquery_file_loads.
TriggerCopyJobs
(create_disposition=None, write_disposition=None, test_client=None)[source]¶ Bases:
apache_beam.transforms.core.DoFn
Launches jobs to copy from temporary tables into the main target table.
When a job needs to write to multiple destination tables, or when a single destination table needs to have multiple load jobs to write to it, files are loaded into temporary tables, and those tables are later copied to the destination tables.
This transform emits (destination, job_reference) pairs.
- TODO(BEAM-7822): In file loads method of writing to BigQuery,
- copying from temp_tables to destination_table is not atomic. See: https://issues.apache.org/jira/browse/BEAM-7822
-
class
apache_beam.io.gcp.bigquery_file_loads.
TriggerLoadJobs
(schema=None, create_disposition=None, write_disposition=None, test_client=None, temporary_tables=False, additional_bq_parameters=None)[source]¶ Bases:
apache_beam.transforms.core.DoFn
Triggers the import jobs to BQ.
Experimental; no backwards compatibility guarantees.
-
TEMP_TABLES
= 'TemporaryTables'¶
-
-
class
apache_beam.io.gcp.bigquery_file_loads.
PartitionFiles
(max_partition_size, max_files_per_partition)[source]¶ Bases:
apache_beam.transforms.core.DoFn
-
MULTIPLE_PARTITIONS_TAG
= 'MULTIPLE_PARTITIONS'¶
-
SINGLE_PARTITION_TAG
= 'SINGLE_PARTITION'¶
-
-
class
apache_beam.io.gcp.bigquery_file_loads.
WaitForBQJobs
(test_client)[source]¶ Bases:
apache_beam.transforms.core.DoFn
Takes in a series of BQ job names as side input, and waits for all of them.
If any job fails, it will fail. If all jobs succeed, it will succeed.
Experimental; no backwards compatibility guarantees.
-
ALL_DONE
= <object object>¶
-
FAILED
= <object object>¶
-
WAITING
= <object object>¶
-
-
class
apache_beam.io.gcp.bigquery_file_loads.
BigQueryBatchFileLoads
(destination, schema=None, custom_gcs_temp_location=None, create_disposition=None, write_disposition=None, triggering_frequency=None, coder=None, max_file_size=None, max_files_per_bundle=None, max_partition_size=None, max_files_per_partition=None, additional_bq_parameters=None, table_side_inputs=None, schema_side_inputs=None, test_client=None, validate=True, is_streaming_pipeline=False)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Takes in a set of elements, and inserts them to BigQuery via batch loads.
-
DESTINATION_JOBID_PAIRS
= 'destination_load_jobid_pairs'¶
-
DESTINATION_FILE_PAIRS
= 'destination_file_pairs'¶
-
DESTINATION_COPY_JOBID_PAIRS
= 'destination_copy_jobid_pairs'¶
-