apache_beam.dataframe.convert module

apache_beam.dataframe.convert.to_dataframe(pcoll, proxy=None, label=None)[source]

Converts a PCollection to a deferred dataframe-like object, which can manipulated with pandas methods like filter and groupby.

For example, one might write:

pcoll = ...
df = to_dataframe(pcoll, proxy=...)
result = df.groupby('col').sum()
pcoll_result = to_pcollection(result)

A proxy object must be given if the schema for the PCollection is not known.

apache_beam.dataframe.convert.to_pcollection(*dataframes, **kwargs)[source]

Converts one or more deferred dataframe-like objects back to a PCollection.

This method creates and applies the actual Beam operations that compute the given deferred dataframes, returning a PCollection of their results. By default the resulting PCollections are schema-aware PCollections where each element is one row from the output dataframes, excluding indexes. This behavior can be modified with the yield_elements and include_indexes arguments.

If more than one (related) result is desired, it can be more efficient to pass them all at the same time to this method.

  • always_return_tuple – (optional, default: False) If true, always return a tuple of PCollections, even if there’s only one output.
  • yield_elements – (optional, default: “schemas”) If set to “pandas”, return PCollections containing the raw Pandas objects (DataFrames or Series), if set to “schemas”, return an element-wise PCollection, where DataFrame and Series instances are expanded to one element per row. DataFrames are converted to schema-aware PCollections, where column values can be accessed by attribute.
  • include_indexes – (optional, default: False) When yield_elements=”schemas”, if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.