GroupBy
![]() |
Takes a collection of elements and produces a collection grouped, by properties of those elements.
Unlike GroupByKey
, the key is dynamically created from the elements themselves.
Grouping Examples
In the following example, we create a pipeline with a PCollection
of fruits.
We use GroupBy
to group all fruits by the first letter of their name.
Output:
We can group by a composite key consisting of multiple properties if desired.
The resulting key is a named tuple with the two requested attributes, and the values are grouped accordingly.
Output:
In the case that the property one wishes to group by is an attribute, a string
may be passed to GroupBy
in the place of a callable expression. For example,
suppose I have the following data
GROCERY_LIST = [
beam.Row(recipe='pie', fruit='raspberry', quantity=1, unit_price=3.50),
beam.Row(recipe='pie', fruit='blackberry', quantity=1, unit_price=4.00),
beam.Row(recipe='pie', fruit='blueberry', quantity=1, unit_price=2.00),
beam.Row(recipe='muffin', fruit='blueberry', quantity=2, unit_price=2.00),
beam.Row(recipe='muffin', fruit='banana', quantity=3, unit_price=1.00),
]
We can then do
Output:
(
'pie',
[
beam.Row(
recipe='pie',
fruit='strawberry',
quantity=3,
unit_price=1.50),
beam.Row(
recipe='pie',
fruit='raspberry',
quantity=1,
unit_price=3.50),
beam.Row(
recipe='pie',
fruit='blackberry',
quantity=1,
unit_price=4.00),
beam.Row(
recipe='pie',
fruit='blueberry',
quantity=1,
unit_price=2.00),
]),
(
'muffin',
[
beam.Row(
recipe='muffin',
fruit='blueberry',
quantity=2,
unit_price=2.00),
beam.Row(
recipe='muffin',
fruit='banana',
quantity=3,
unit_price=1.00),
]),
It is possible to mix and match attributes and expressions, for example
Output:
(
NamedTuple(recipe='pie', is_berry=True),
[
beam.Row(
recipe='pie',
fruit='strawberry',
quantity=3,
unit_price=1.50),
beam.Row(
recipe='pie',
fruit='raspberry',
quantity=1,
unit_price=3.50),
beam.Row(
recipe='pie',
fruit='blackberry',
quantity=1,
unit_price=4.00),
beam.Row(
recipe='pie',
fruit='blueberry',
quantity=1,
unit_price=2.00),
]),
(
NamedTuple(recipe='muffin', is_berry=True),
[
beam.Row(
recipe='muffin',
fruit='blueberry',
quantity=2,
unit_price=2.00),
]),
(
NamedTuple(recipe='muffin', is_berry=False),
[
beam.Row(
recipe='muffin',
fruit='banana',
quantity=3,
unit_price=1.00),
]),
Aggregation
Grouping is often used in conjunction with aggregation, and the
aggregate_field
method of the GroupBy
transform can be used to accomplish
this easily.
This method takes three parameters: the field (or expression) which to
aggregate, the CombineFn
(or associative callable
) with which to aggregate
by, and finally a field name in which to store the result.
For example, suppose one wanted to compute the amount of each fruit to buy.
One could write
Output:
Similar to the parameters in GroupBy
, one can also aggregate multiple fields
and by expressions.
Output:
One can, of course, aggregate the same field multiple times as well. This example also illustrates a global grouping, as the grouping key is empty.
Output:
Related transforms
- CombinePerKey for combining with a single CombineFn.
- GroupByKey for grouping with a known key.
- CoGroupByKey for multiple input collections.
![]() |
Last updated on 2023/06/05
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!