apache_beam.dataframe.schemas module¶
Utilities for relating schema-aware PCollections and dataframe transforms.
Imposes a mapping between native Python typings (specifically those compatible
with apache_beam.typehints.schemas), and common pandas dtypes:
pandas dtype Python typing
np.int{8,16,32,64} <-----> np.int{8,16,32,64}*
pd.Int{8,16,32,64}Dtype <-----> Optional[np.int{8,16,32,64}]*
np.float{32,64} <-----> Optional[np.float{32,64}]
\--- np.float{32,64}
Not supported <------ Optional[bytes]
np.bool <-----> np.bool
np.dtype('S') <-----> bytes
pd.BooleanDType() <-----> Optional[bool]
pd.StringDType() <-----> Optional[str]
\--- str
np.object <-----> Any
* int, float, bool are treated the same as np.int64, np.float64, np.bool
Note that when converting to pandas dtypes, any types not specified here are
shunted to np.object.
Similarly when converting from pandas to Python types, types that aren’t
otherwise specified here are shunted to Any. Notably, this includes
np.datetime64.
Pandas does not support hierarchical data natively. Currently, all structured
types (Sequence, Mapping, nested NamedTuple types), are
shunted to np.object like all other unknown types. In the future these
types may be given special consideration.
-
class
apache_beam.dataframe.schemas.BatchRowsAsDataFrame(*args, proxy=None, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformA transform that batches schema-aware PCollection elements into DataFrames
Batching parameters are inherited from
BatchElements.
-
apache_beam.dataframe.schemas.generate_proxy(element_type)[source]¶ Generate a proxy pandas object for the given PCollection element_type.
Currently only supports generating a DataFrame proxy from a schema-aware PCollection or a Series proxy from a primitively typed PCollection.
-
apache_beam.dataframe.schemas.element_type_from_dataframe(proxy, include_indexes=False)[source]¶ Generate an element_type for an element-wise PCollection from a proxy pandas object. Currently only supports converting the element_type for a schema-aware PCollection to a proxy DataFrame.
Currently only supports generating a DataFrame proxy from a schema-aware PCollection.
-
class
apache_beam.dataframe.schemas.UnbatchPandas(proxy, include_indexes=False)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformA transform that explodes a PCollection of DataFrame or Series. DataFrame is converterd to a schema-aware PCollection, while Series is converted to its underlying type.
Parameters: include_indexes – (optional, default: False) When unbatching a DataFrame if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.