CoGroupByKey
Pydoc |
Aggregates all input elements by their key and allows downstream processing
to consume all values associated with the key. While GroupByKey
performs
this operation over a single input collection and thus a single type of input
values, CoGroupByKey
operates over multiple input collections. As a result,
the result for each key is a tuple of the values associated with that key in
each input collection.
See more information in the Beam Programming Guide.
Examples
In the following example, we create a pipeline with two PCollection
s of produce, one with icons and one with durations, both with a common key of the produce name.
Then, we apply CoGroupByKey
to join both PCollection
s using their keys.
CoGroupByKey
expects a dictionary of named keyed PCollection
s, and produces elements joined by their keys.
The values of each output element are dictionaries where the names correspond to the input dictionary, with lists of all the values found for that key.
import apache_beam as beam
with beam.Pipeline() as pipeline:
icon_pairs = pipeline | 'Create icons' >> beam.Create([
('Apple', '๐'),
('Apple', '๐'),
('Eggplant', '๐'),
('Tomato', '๐
'),
])
duration_pairs = pipeline | 'Create durations' >> beam.Create([
('Apple', 'perennial'),
('Carrot', 'biennial'),
('Tomato', 'perennial'),
('Tomato', 'annual'),
])
plants = (({
'icons': icon_pairs, 'durations': duration_pairs
})
| 'Merge' >> beam.CoGroupByKey()
| beam.Map(print))
Output:
Related transforms
- CombineGlobally to combine elements.
- GroupByKey takes one input collection.
Pydoc |
Last updated on 2024/11/20
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!