apache_beam.dataframe.schemas module

Utilities for relating schema-aware PCollections and dataframe transforms.

Imposes a mapping between native Python typings (specifically those compatible with apache_beam.typehints.schemas), and common pandas dtypes:

pandas dtype                    Python typing
np.int{8,16,32,64}      <-----> np.int{8,16,32,64}*
pd.Int{8,16,32,64}Dtype <-----> Optional[np.int{8,16,32,64}]*
np.float{32,64}         <-----> Optional[np.float{32,64}]
                           \--- np.float{32,64}
Not supported           <------ Optional[bytes]
np.bool                 <-----> np.bool
np.dtype('S')           <-----> bytes
pd.BooleanDType()       <-----> Optional[bool]
pd.StringDType()        <-----> Optional[str]
                           \--- str
np.object               <-----> Any

* int, float, bool are treated the same as np.int64, np.float64, np.bool

Note that when converting to pandas dtypes, any types not specified here are shunted to np.object.

Similarly when converting from pandas to Python types, types that aren’t otherwise specified here are shunted to Any. Notably, this includes np.datetime64.

Pandas does not support hierarchical data natively. Currently, all structured types (Sequence, Mapping, nested NamedTuple types), are shunted to np.object like all other unknown types. In the future these types may be given special consideration.

class apache_beam.dataframe.schemas.BatchRowsAsDataFrame(*args, proxy=None, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A transform that batches schema-aware PCollection elements into DataFrames

Batching parameters are inherited from BatchElements.


Generate a proxy pandas object for the given PCollection element_type.

Currently only supports generating a DataFrame proxy from a schema-aware PCollection or a Series proxy from a primitively typed PCollection.

apache_beam.dataframe.schemas.element_type_from_dataframe(proxy, include_indexes=False)[source]

Generate an element_type for an element-wise PCollection from a proxy pandas object. Currently only supports converting the element_type for a schema-aware PCollection to a proxy DataFrame.

Currently only supports generating a DataFrame proxy from a schema-aware PCollection.

class apache_beam.dataframe.schemas.UnbatchPandas(proxy, include_indexes=False)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A transform that explodes a PCollection of DataFrame or Series. DataFrame is converterd to a schema-aware PCollection, while Series is converted to its underlying type.

Parameters:include_indexes – (optional, default: False) When unbatching a DataFrame if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.