to_dataframe(pcoll, proxy=None, label=None)¶
Converts a PCollection to a deferred dataframe-like object, which can manipulated with pandas methods like filter and groupby.
For example, one might write:
pcoll = ... df = to_dataframe(pcoll, proxy=...) result = df.groupby('col').sum() pcoll_result = to_pcollection(result)
A proxy object must be given if the schema for the PCollection is not known.
Converts one or more deferred dataframe-like objects back to a PCollection.
This method creates and applies the actual Beam operations that compute the given deferred dataframes, returning a PCollection of their results. By default the resulting PCollections are schema-aware PCollections where each element is one row from the output dataframes, excluding indexes. This behavior can be modified with the yield_elements and include_indexes arguments.
If more than one (related) result is desired, it can be more efficient to pass them all at the same time to this method.
- always_return_tuple – (optional, default: False) If true, always return a tuple of PCollections, even if there’s only one output.
- yield_elements – (optional, default: “schemas”) If set to “pandas”, return PCollections containing the raw Pandas objects (DataFrames or Series), if set to “schemas”, return an element-wise PCollection, where DataFrame and Series instances are expanded to one element per row. DataFrames are converted to schema-aware PCollections, where column values can be accessed by attribute.
- include_indexes – (optional, default: False) When yield_elements=”schemas”, if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.