apache_beam.dataframe.schemas module¶
Utilities for relating schema-aware PCollections and DataFrame transforms.
The utilities here enforce the type mapping defined in
apache_beam.typehints.pandas_type_compatibility
.
-
class
apache_beam.dataframe.schemas.
BatchRowsAsDataFrame
(*args, proxy=None, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A transform that batches schema-aware PCollection elements into DataFrames
Batching parameters are inherited from
BatchElements
.
-
apache_beam.dataframe.schemas.
generate_proxy
(element_type: type) → pandas.core.frame.DataFrame[source]¶ Generate a proxy pandas object for the given PCollection element_type.
Currently only supports generating a DataFrame proxy from a schema-aware PCollection or a Series proxy from a primitively typed PCollection.
-
apache_beam.dataframe.schemas.
element_type_from_dataframe
(proxy: pandas.core.frame.DataFrame, include_indexes: bool = False) → type[source]¶ Generate an element_type for an element-wise PCollection from a proxy pandas object. Currently only supports converting the element_type for a schema-aware PCollection to a proxy DataFrame.
Currently only supports generating a DataFrame proxy from a schema-aware PCollection.
-
class
apache_beam.dataframe.schemas.
UnbatchPandas
(proxy, include_indexes=False)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A transform that explodes a PCollection of DataFrame or Series. DataFrame is converterd to a schema-aware PCollection, while Series is converted to its underlying type.
Parameters: include_indexes – (optional, default: False) When unbatching a DataFrame if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.