apache_beam.transforms.enrichment_handlers.bigquery module

class apache_beam.transforms.enrichment_handlers.bigquery.BigQueryEnrichmentHandler(project: str, *, table_name: str = '', row_restriction_template: str = '', fields: Optional[List[str]] = None, column_names: Optional[List[str]] = None, condition_value_fn: Optional[Callable[[apache_beam.pvalue.Row], List[Any]]] = None, query_fn: Optional[Callable[[apache_beam.pvalue.Row], str]] = None, min_batch_size: int = 1, max_batch_size: int = 10000, **kwargs)[source]

Bases: apache_beam.transforms.enrichment.EnrichmentSourceHandler

Enrichment handler for Google Cloud BigQuery.

Use this handler with apache_beam.transforms.enrichment.Enrichment transform.

To use this handler you need either of the following combinations:
  • table_name, row_restriction_template, fields
  • table_name, row_restriction_template, condition_value_fn
  • query_fn

By default, the handler pulls all columns from the BigQuery table. To override this, use the column_name parameter to specify a list of column names to fetch.

This handler pulls data from BigQuery per element by default. To change this behavior, set the min_batch_size and max_batch_size parameters. These min and max values for batch size are sent to the apache_beam.transforms.utils.BatchElements transform.

NOTE: Elements cannot be batched when using the query_fn parameter.

Example Usage:
handler = BigQueryEnrichmentHandler(project=project_name,
row_restriction=”id=’{}’”, table_name=’project.dataset.table’, fields=fields, min_batch_size=2, max_batch_size=100)
Parameters:
  • project – Google Cloud project ID for the BigQuery table.
  • table_name (str) – Fully qualified BigQuery table name in the format project.dataset.table.
  • row_restriction_template (str) – A template string for the WHERE clause in the BigQuery query with placeholders ({}) to dynamically filter rows based on input data.
  • fields – (Optional[List[str]]) List of field names present in the input beam.Row. These are used to construct the WHERE clause (if condition_value_fn is not provided).
  • column_names – (Optional[List[str]]) Names of columns to select from the BigQuery table. If not provided, all columns (*) are selected.
  • condition_value_fn – (Optional[Callable[[beam.Row], Any]]) A function that takes a beam.Row and returns a list of value to populate in the placeholder {} of WHERE clause in the query.
  • query_fn – (Optional[Callable[[beam.Row], str]]) A function that takes a beam.Row and returns a complete BigQuery SQL query string.
  • min_batch_size (int) – Minimum number of rows to batch together when querying BigQuery. Defaults to 1 if query_fn is not specified.
  • max_batch_size (int) – Maximum number of rows to batch together. Defaults to 10,000 if query_fn is not specified.
  • **kwargs – Additional keyword arguments to pass to bigquery.Client.

Note

  • min_batch_size and max_batch_size cannot be defined if the query_fn is provided.
  • Either fields or condition_value_fn must be provided for query construction if query_fn is not provided.
  • Ensure appropriate permissions are granted for BigQuery access.
get_cache_key(request: Union[apache_beam.pvalue.Row, List[apache_beam.pvalue.Row]])[source]
batch_elements_kwargs() → Mapping[str, Any][source]

Returns a kwargs suitable for beam.BatchElements.