Takes a collection of elements and produces a collection grouped, by properties of those elements.
GroupByKey, the key is dynamically created from the elements themselves.
In the following example, we create a pipeline with a
PCollection of fruits.
GroupBy to group all fruits by the first letter of their name.
We can group by a composite key consisting of multiple properties if desired.
The resulting key is a named tuple with the two requested attributes, and the values are grouped accordingly.
In the case that the property one wishes to group by is an attribute, a string
may be passed to
GroupBy in the place of a callable expression.
It is possible to mix and match attributes and expressions, for example
Grouping is often used in conjunction with aggregation, and the
aggregate_field method of the
GroupBy transform can be used to accomplish
This method takes three parameters: the field (or expression) which to
CombineFn (or associative
callable) with which to aggregate
by, and finally a field name in which to store the result.
For example, suppose one wanted to compute the amount of each fruit to buy.
One could write
Similar to the parameters in
GroupBy, one can also aggregate multiple fields
and by expressions.
One can, of course, aggregate the same field multiple times as well. This example also illustrates a global grouping, as the grouping key is empty.
- CombinePerKey for combining with a single CombineFn.
- GroupByKey for grouping with a known key.
- CoGroupByKey for multiple input collections.
Last updated on 2023/11/30
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!