apache_beam.dataframe.transforms module¶
-
class
apache_beam.dataframe.transforms.
DataframeTransform
(func, proxy=None, yield_elements='schemas', include_indexes=False)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A PTransform for applying function that takes and returns dataframes to one or more PCollections.
DataframeTransform will accept a PCollection with a schema and batch it into dataframes if necessary. In this case the proxy can be omitted:
- (pcoll | beam.Row(key=…, foo=…, bar=…)
- DataframeTransform(lambda df: df.group_by(‘key’).sum()))
It is also possible to process a PCollection of dataframes directly, in this case a proxy must be provided. For example, if pcoll is a PCollection of dataframes, one could write:
pcoll | DataframeTransform(lambda df: df.group_by('key').sum(), proxy=...)
To pass multiple PCollections, pass a tuple of PCollections wich will be passed to the callable as positional arguments, or a dictionary of PCollections, in which case they will be passed as keyword arguments.
Parameters: - yield_elements – (optional, default: “schemas”) If set to “pandas”, return PCollections containing the raw Pandas objects (DataFrames or Series), if set to “schemas”, return an element-wise PCollection, where DataFrame and Series instances are expanded to one element per row. DataFrames are converted to schema-aware PCollections, where column values can be accessed by attribute.
- include_indexes – (optional, default: False) When yield_elements=”schemas”, if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.