GroupBy

Pydoc

Takes a collection of elements and produces a collection grouped, by properties of those elements.

Unlike GroupByKey, the key is dynamically created from the elements themselves.

Grouping Examples

In the following example, we create a pipeline with a PCollection of fruits.

We use GroupBy to group all fruits by the first letter of their name.

We can group by a composite key consisting of multiple properties if desired.

The resulting key is a named tuple with the two requested attributes, and the values are grouped accordingly.

In the case that the property one wishes to group by is an attribute, a string may be passed to GroupBy in the place of a callable expression.

It is possible to mix and match attributes and expressions, for example

Aggregation

Grouping is often used in conjunction with aggregation, and the aggregate_field method of the GroupBy transform can be used to accomplish this easily. This method takes three parameters: the field (or expression) which to aggregate, the CombineFn (or associative callable) with which to aggregate by, and finally a field name in which to store the result. For example, suppose one wanted to compute the amount of each fruit to buy. One could write

Similar to the parameters in GroupBy, one can also aggregate multiple fields and by expressions.

One can, of course, aggregate the same field multiple times as well. This example also illustrates a global grouping, as the grouping key is empty.

CombinePerKey for combining with a single CombineFn.
GroupByKey for grouping with a known key.
CoGroupByKey for multiple input collections.

Pydoc

Last updated on 2025/07/02

Have you found everything you were looking for?

Was it all useful and clear? Is there anything that you would like to change? Let us know!

GroupBy

Grouping Examples

Aggregation

Related transforms

Have you found everything you were looking for?