apache_beam.dataframe.schemas module¶
Utilities for relating schema-aware PCollections and dataframe transforms.
Imposes a mapping between native Python typings (specifically those compatible
with apache_beam.typehints.schemas
), and common pandas dtypes:
pandas dtype Python typing
np.int{8,16,32,64} <-----> np.int{8,16,32,64}*
pd.Int{8,16,32,64}Dtype <-----> Optional[np.int{8,16,32,64}]*
np.float{32,64} <-----> Optional[np.float{32,64}]
\--- np.float{32,64}
Not supported <------ Optional[bytes]
np.bool <-----> np.bool
np.dtype('S') <-----> bytes
pd.BooleanDType() <-----> Optional[bool]
pd.StringDType() <-----> Optional[str]
\--- str
np.object <-----> Any
* int, float, bool are treated the same as np.int64, np.float64, np.bool
Note that when converting to pandas dtypes, any types not specified here are
shunted to np.object
.
Similarly when converting from pandas to Python types, types that aren’t
otherwise specified here are shunted to Any
. Notably, this includes
np.datetime64
.
Pandas does not support hierarchical data natively. Currently, all structured
types (Sequence
, Mapping
, nested NamedTuple
types), are
shunted to np.object
like all other unknown types. In the future these
types may be given special consideration.
-
class
apache_beam.dataframe.schemas.
BatchRowsAsDataFrame
(*args, proxy=None, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A transform that batches schema-aware PCollection elements into DataFrames
Batching parameters are inherited from
BatchElements
.
-
apache_beam.dataframe.schemas.
generate_proxy
(element_type)[source]¶ Generate a proxy pandas object for the given PCollection element_type.
Currently only supports generating a DataFrame proxy from a schema-aware PCollection or a Series proxy from a primitively typed PCollection.
-
apache_beam.dataframe.schemas.
element_type_from_dataframe
(proxy, include_indexes=False)[source]¶ Generate an element_type for an element-wise PCollection from a proxy pandas object. Currently only supports converting the element_type for a schema-aware PCollection to a proxy DataFrame.
Currently only supports generating a DataFrame proxy from a schema-aware PCollection.
-
class
apache_beam.dataframe.schemas.
UnbatchPandas
(proxy, include_indexes=False)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A transform that explodes a PCollection of DataFrame or Series. DataFrame is converterd to a schema-aware PCollection, while Series is converted to its underlying type.
Parameters: include_indexes – (optional, default: False) When unbatching a DataFrame if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.