apache_beam.dataframe.schemas module

Utilities for relating schema-aware PCollections and dataframe transforms.

Imposes a mapping between native Python typings (specifically those compatible with apache_beam.typehints.schemas), and common pandas dtypes:

pandas dtype                    Python typing
np.int{8,16,32,64}      <-----> np.int{8,16,32,64}*
pd.Int{8,16,32,64}Dtype <-----> Optional[np.int{8,16,32,64}]*
np.float{32,64}         <-----> Optional[np.float{32,64}]
                           \--- np.float{32,64}
Not supported           <------ Optional[bytes]
np.bool                 <-----> np.bool

* int, float, bool are treated the same as np.int64, np.float64, np.bool

Any unknown or unsupported types are treated as Any and shunted to np.object:

np.object               <-----> Any

bytes, unicode strings and nullable Booleans are handled differently when using pandas 0.x vs. 1.x. pandas 0.x has no mapping for these types, so they are shunted to np.object.

pandas 1.x Only:

np.dtype('S')     <-----> bytes
pd.BooleanDType() <-----> Optional[bool]
pd.StringDType()  <-----> Optional[str]
                     \--- str

Pandas does not support hierarchical data natively. Currently, all structured types (Sequence, Mapping, nested NamedTuple types), are shunted to np.object like all other unknown types. In the future these types may be given special consideration.

class apache_beam.dataframe.schemas.BatchRowsAsDataFrame(*args, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A transform that batches schema-aware PCollection elements into DataFrames

Batching parameters are inherited from BatchElements.

expand(pcoll)[source]
apache_beam.dataframe.schemas.generate_proxy(element_type)[source]

Generate a proxy pandas object for the given PCollection element_type.

Currently only supports generating a DataFrame proxy from a schema-aware PCollection.

apache_beam.dataframe.schemas.element_type_from_dataframe(proxy, include_indexes=False)[source]

Generate an element_type for an element-wise PCollection from a proxy pandas object. Currently only supports converting the element_type for a schema-aware PCollection to a proxy DataFrame.

Currently only supports generating a DataFrame proxy from a schema-aware PCollection.

class apache_beam.dataframe.schemas.UnbatchPandas(proxy, include_indexes=False)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A transform that explodes a PCollection of DataFrame or Series. DataFrame is converterd to a schema-aware PCollection, while Series is converted to its underlying type.

Parameters:include_indexes – (optional, default: False) When unbatching a DataFrame if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.
expand(pcoll)[source]