apache_beam.dataframe.transforms module

class apache_beam.dataframe.transforms.DataframeTransform(func, proxy=None, yield_elements='schemas', include_indexes=False)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for applying function that takes and returns dataframes to one or more PCollections.

DataframeTransform will accept a PCollection with a schema and batch it into dataframes if necessary. In this case the proxy can be omitted:

(pcoll | beam.Row(key=…, foo=…, bar=…)
DataframeTransform(lambda df: df.group_by(‘key’).sum()))

It is also possible to process a PCollection of dataframes directly, in this case a proxy must be provided. For example, if pcoll is a PCollection of dataframes, one could write:

pcoll | DataframeTransform(lambda df: df.group_by('key').sum(), proxy=...)

To pass multiple PCollections, pass a tuple of PCollections wich will be passed to the callable as positional arguments, or a dictionary of PCollections, in which case they will be passed as keyword arguments.

  • yield_elements – (optional, default: “schemas”) If set to “pandas”, return PCollections containing the raw Pandas objects (DataFrames or Series), if set to “schemas”, return an element-wise PCollection, where DataFrame and Series instances are expanded to one element per row. DataFrames are converted to schema-aware PCollections, where column values can be accessed by attribute.
  • include_indexes – (optional, default: False) When yield_elements=”schemas”, if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.