apache_beam.dataframe.schemas module

Utilities for relating schema-aware PCollections and DataFrame transforms.

The utilities here enforce the type mapping defined in apache_beam.typehints.pandas_type_compatibility.

class apache_beam.dataframe.schemas.BatchRowsAsDataFrame(*args, proxy=None, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A transform that batches schema-aware PCollection elements into DataFrames

Batching parameters are inherited from BatchElements.

expand(pcoll)[source]
apache_beam.dataframe.schemas.generate_proxy(element_type: type) → pandas.core.frame.DataFrame[source]

Generate a proxy pandas object for the given PCollection element_type.

Currently only supports generating a DataFrame proxy from a schema-aware PCollection or a Series proxy from a primitively typed PCollection.

apache_beam.dataframe.schemas.element_type_from_dataframe(proxy: pandas.core.frame.DataFrame, include_indexes: bool = False) → type[source]

Generate an element_type for an element-wise PCollection from a proxy pandas object. Currently only supports converting the element_type for a schema-aware PCollection to a proxy DataFrame.

Currently only supports generating a DataFrame proxy from a schema-aware PCollection.

class apache_beam.dataframe.schemas.UnbatchPandas(proxy, include_indexes=False)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A transform that explodes a PCollection of DataFrame or Series. DataFrame is converterd to a schema-aware PCollection, while Series is converted to its underlying type.

Parameters:include_indexes – (optional, default: False) When unbatching a DataFrame if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.
expand(pcoll)[source]