apache_beam.transforms.enrichment_handlers.bigquery module¶
-
class
apache_beam.transforms.enrichment_handlers.bigquery.
BigQueryEnrichmentHandler
(project: str, *, table_name: str = '', row_restriction_template: str = '', fields: Optional[List[str]] = None, column_names: Optional[List[str]] = None, condition_value_fn: Optional[Callable[[apache_beam.pvalue.Row], List[Any]]] = None, query_fn: Optional[Callable[[apache_beam.pvalue.Row], str]] = None, min_batch_size: int = 1, max_batch_size: int = 10000, **kwargs)[source]¶ Bases:
apache_beam.transforms.enrichment.EnrichmentSourceHandler
Enrichment handler for Google Cloud BigQuery.
Use this handler with
apache_beam.transforms.enrichment.Enrichment
transform.- To use this handler you need either of the following combinations:
- table_name, row_restriction_template, fields
- table_name, row_restriction_template, condition_value_fn
- query_fn
By default, the handler pulls all columns from the BigQuery table. To override this, use the column_name parameter to specify a list of column names to fetch.
This handler pulls data from BigQuery per element by default. To change this behavior, set the min_batch_size and max_batch_size parameters. These min and max values for batch size are sent to the
apache_beam.transforms.utils.BatchElements
transform.NOTE: Elements cannot be batched when using the query_fn parameter.
- Example Usage:
- handler = BigQueryEnrichmentHandler(project=project_name,
- row_restriction=”id=’{}’”, table_name=’project.dataset.table’, fields=fields, min_batch_size=2, max_batch_size=100)
Parameters: - project – Google Cloud project ID for the BigQuery table.
- table_name (str) – Fully qualified BigQuery table name in the format project.dataset.table.
- row_restriction_template (str) – A template string for the WHERE clause in the BigQuery query with placeholders ({}) to dynamically filter rows based on input data.
- fields – (Optional[List[str]]) List of field names present in the input beam.Row. These are used to construct the WHERE clause (if condition_value_fn is not provided).
- column_names – (Optional[List[str]]) Names of columns to select from the BigQuery table. If not provided, all columns (*) are selected.
- condition_value_fn – (Optional[Callable[[beam.Row], Any]]) A function that takes a beam.Row and returns a list of value to populate in the placeholder {} of WHERE clause in the query.
- query_fn – (Optional[Callable[[beam.Row], str]]) A function that takes a beam.Row and returns a complete BigQuery SQL query string.
- min_batch_size (int) – Minimum number of rows to batch together when querying BigQuery. Defaults to 1 if query_fn is not specified.
- max_batch_size (int) – Maximum number of rows to batch together. Defaults to 10,000 if query_fn is not specified.
- **kwargs – Additional keyword arguments to pass to bigquery.Client.
Note
- min_batch_size and max_batch_size cannot be defined if the query_fn is provided.
- Either fields or condition_value_fn must be provided for query construction if query_fn is not provided.
- Ensure appropriate permissions are granted for BigQuery access.