apache_beam.transforms.enrichment_handlers.bigquery module

class apache_beam.transforms.enrichment_handlers.bigquery.BigQueryEnrichmentHandler(project: str, *, table_name: str = '', row_restriction_template: str = '', fields: List[str] | None = None, column_names: List[str] | None = None, condition_value_fn: Callable[[Row], List[Any]] | None = None, query_fn: Callable[[Row], str] | None = None, min_batch_size: int = 1, max_batch_size: int = 10000, **kwargs)[source]

Bases: EnrichmentSourceHandler[Row | List[Row], Row | List[Row]]

Enrichment handler for Google Cloud BigQuery.

Use this handler with apache_beam.transforms.enrichment.Enrichment transform.

To use this handler you need either of the following combinations:
  • table_name, row_restriction_template, fields

  • table_name, row_restriction_template, condition_value_fn

  • query_fn

By default, the handler pulls all columns from the BigQuery table. To override this, use the column_name parameter to specify a list of column names to fetch.

This handler pulls data from BigQuery per element by default. To change this behavior, set the min_batch_size and max_batch_size parameters. These min and max values for batch size are sent to the apache_beam.transforms.utils.BatchElements transform.

NOTE: Elements cannot be batched when using the query_fn parameter.

Example Usage:
handler = BigQueryEnrichmentHandler(project=project_name,

row_restriction=”id=’{}’”, table_name=’project.dataset.table’, fields=fields, min_batch_size=2, max_batch_size=100)

Parameters:
  • project – Google Cloud project ID for the BigQuery table.

  • table_name (str) – Fully qualified BigQuery table name in the format project.dataset.table.

  • row_restriction_template (str) – A template string for the WHERE clause in the BigQuery query with placeholders ({}) to dynamically filter rows based on input data.

  • fields – (Optional[List[str]]) List of field names present in the input beam.Row. These are used to construct the WHERE clause (if condition_value_fn is not provided).

  • column_names – (Optional[List[str]]) Names of columns to select from the BigQuery table. If not provided, all columns (*) are selected.

  • condition_value_fn – (Optional[Callable[[beam.Row], Any]]) A function that takes a beam.Row and returns a list of value to populate in the placeholder {} of WHERE clause in the query.

  • query_fn – (Optional[Callable[[beam.Row], str]]) A function that takes a beam.Row and returns a complete BigQuery SQL query string.

  • min_batch_size (int) – Minimum number of rows to batch together when querying BigQuery. Defaults to 1 if query_fn is not specified.

  • max_batch_size (int) – Maximum number of rows to batch together. Defaults to 10,000 if query_fn is not specified.

  • **kwargs – Additional keyword arguments to pass to bigquery.Client.

Note

  • min_batch_size and max_batch_size cannot be defined if the query_fn is provided.

  • Either fields or condition_value_fn must be provided for query construction if query_fn is not provided.

  • Ensure appropriate permissions are granted for BigQuery access.

get_cache_key(request: Row | List[Row])[source]
batch_elements_kwargs() Mapping[str, Any][source]

Returns a kwargs suitable for beam.BatchElements.