apache_beam.dataframe.schemas module¶
Utilities for relating schema-aware PCollections and dataframe transforms.
Imposes a mapping between native Python typings (specifically those compatible
with apache_beam.typehints.schemas), and common pandas dtypes:
pandas dtype Python typing
np.int{8,16,32,64} <-----> np.int{8,16,32,64}*
pd.Int{8,16,32,64}Dtype <-----> Optional[np.int{8,16,32,64}]*
np.float{32,64} <-----> Optional[np.float{32,64}]
\--- np.float{32,64}
Not supported <------ Optional[bytes]
np.bool <-----> np.bool
* int, float, bool are treated the same as np.int64, np.float64, np.bool
Any unknown or unsupported types are treated as Any and shunted to
np.object:
np.object <-----> Any
bytes, unicode strings and nullable Booleans are handled differently when using
pandas 0.x vs. 1.x. pandas 0.x has no mapping for these types, so they are
shunted to np.object.
pandas 1.x Only:
np.dtype('S') <-----> bytes
pd.BooleanDType() <-----> Optional[bool]
pd.StringDType() <-----> Optional[str]
\--- str
Pandas does not support hierarchical data natively. Currently, all structured
types (Sequence, Mapping, nested NamedTuple types), are
shunted to np.object like all other unknown types. In the future these
types may be given special consideration.
-
class
apache_beam.dataframe.schemas.BatchRowsAsDataFrame(*args, proxy=None, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformA transform that batches schema-aware PCollection elements into DataFrames
Batching parameters are inherited from
BatchElements.
-
apache_beam.dataframe.schemas.generate_proxy(element_type)[source]¶ Generate a proxy pandas object for the given PCollection element_type.
Currently only supports generating a DataFrame proxy from a schema-aware PCollection or a Series proxy from a primitively typed PCollection.
-
apache_beam.dataframe.schemas.element_type_from_dataframe(proxy, include_indexes=False)[source]¶ Generate an element_type for an element-wise PCollection from a proxy pandas object. Currently only supports converting the element_type for a schema-aware PCollection to a proxy DataFrame.
Currently only supports generating a DataFrame proxy from a schema-aware PCollection.
-
class
apache_beam.dataframe.schemas.UnbatchPandas(proxy, include_indexes=False)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformA transform that explodes a PCollection of DataFrame or Series. DataFrame is converterd to a schema-aware PCollection, while Series is converted to its underlying type.
Parameters: include_indexes – (optional, default: False) When unbatching a DataFrame if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.