to_dataframe(pcoll, proxy=None, label=None)¶
Converts a PCollection to a deferred dataframe-like object, which can manipulated with pandas methods like filter and groupby.
For example, one might write:
pcoll = ... df = to_dataframe(pcoll, proxy=...) result = df.groupby('col').sum() pcoll_result = to_pcollection(result)
A proxy object must be given if the schema for the PCollection is not known.
to_pcollection(*dataframes, label=None, always_return_tuple=False, yield_elements='schemas', include_indexes=False, pipeline=None) → Union[apache_beam.pvalue.PCollection, Tuple[apache_beam.pvalue.PCollection, ...]]¶
Converts one or more deferred dataframe-like objects back to a PCollection.
This method creates and applies the actual Beam operations that compute the given deferred dataframes, returning a PCollection of their results. By default the resulting PCollections are schema-aware PCollections where each element is one row from the output dataframes, excluding indexes. This behavior can be modified with the yield_elements and include_indexes arguments.
Also accepts non-deferred pandas dataframes, which are converted to deferred, schema’d PCollections. In this case the contents of the entire dataframe are serialized into the graph, so for large amounts of data it is preferable to write them to disk and read them with one of the read methods.
If more than one (related) result is desired, it can be more efficient to pass them all at the same time to this method.
- label – (optional, default “ToPCollection(…)””) the label to use for the conversion transform.
- always_return_tuple – (optional, default: False) If true, always return a tuple of PCollections, even if there’s only one output.
- yield_elements – (optional, default: “schemas”) If set to “pandas”, return PCollections containing the raw Pandas objects (DataFrames or Series), if set to “schemas”, return an element-wise PCollection, where DataFrame and Series instances are expanded to one element per row. DataFrames are converted to schema-aware PCollections, where column values can be accessed by attribute.
- include_indexes – (optional, default: False) When yield_elements=”schemas”, if include_indexes=True, attempt to include index columns in the output schema for expanded DataFrames. Raises an error if any of the index levels are unnamed (name=None), or if any of the names are not unique among all column and index names.
- pipeline – (optional, unless non-deferred dataframes are passed) Used when creating a PCollection from a non-deferred dataframe.