K - the type of the keys of the input and output
PCollectionsV - the type of the values of the input PCollection
and the elements of the Iterables in the output
PCollectionpublic class GroupByKey<K,V> extends PTransform<PCollection<KV<K,V>>,PCollection<KV<K,java.lang.Iterable<V>>>>
GroupByKey<K, V> takes a PCollection<KV<K, V>>,
groups the values by key and windows, and returns a
PCollection<KV<K, Iterable<V>>> representing a map from
each distinct key and window of the input PCollection to an
Iterable over all the values associated with that key in
the input per window. Absent repeatedly-firing
triggering, each key in the output
PCollection is unique within each window.
GroupByKey is analogous to converting a multi-map into
a uni-map, and related to GROUP BY in SQL. It corresponds
to the "shuffle" step between the Mapper and the Reducer in the
MapReduce framework.
Two keys of type K are compared for equality
not by regular Java Object.equals(java.lang.Object), but instead by
first encoding each of the keys using the Coder of the
keys of the input PCollection, and then comparing the
encoded bytes. This admits efficient parallel evaluation. Note that
this requires that the Coder of the keys be deterministic (see
Coder.verifyDeterministic()). If the key Coder is not
deterministic, an exception is thrown at pipeline construction time.
By default, the Coder of the keys of the output
PCollection is the same as that of the keys of the input,
and the Coder of the elements of the Iterable
values of the output PCollection is the same as the
Coder of the values of the input.
Example of use:
PCollection<KV<String, Doc>> urlDocPairs = ...;
PCollection<KV<String, Iterable<Doc>>> urlToDocs =
urlDocPairs.apply(GroupByKey.<String, Doc>create());
PCollection<R> results =
urlToDocs.apply(ParDo.of(new DoFn<KV<String, Iterable<Doc>>, R>() {
{@literal @}ProcessElement
public void processElement(ProcessContext c) {
String url = c.element().getKey();
Iterable<Doc> docsWithThatUrl = c.element().getValue();
... process all docs having that url ...
}}));
GroupByKey is a key primitive in data-parallel
processing, since it is the main way to efficiently bring
associated data together into one location. It is also a key
determiner of the performance of a data-parallel pipeline.
See CoGroupByKey
for a way to group multiple input PCollections by a common key at once.
See Combine.PerKey for a common pattern of
GroupByKey followed by Combine.GroupedValues.
When grouping, windows that can be merged according to the WindowFn
of the input PCollection will be merged together, and a window pane
corresponding to the new, merged window will be created. The items in this pane
will be emitted when a trigger fires. By default this will be when the input
sources estimate there will be no more data for the window. See
AfterWatermark
for details on the estimation.
The timestamp for each emitted pane is determined by the
Window.withTimestampCombiner(TimestampCombiner) windowing operation}.
The output PCollection will have the same WindowFn
as the input.
If the input PCollection contains late data or the
requested TriggerFn can fire before
the watermark, then there may be multiple elements
output by a GroupByKey that correspond to the same key and window.
If the WindowFn of the input requires merging, it is not
valid to apply another GroupByKey without first applying a new
WindowFn or applying Window.remerge().
name| Modifier and Type | Method and Description |
|---|---|
static void |
applicableTo(PCollection<?> input) |
static <K,V> GroupByKey<K,V> |
create()
Returns a
GroupByKey<K, V> PTransform. |
PCollection<KV<K,java.lang.Iterable<V>>> |
expand(PCollection<KV<K,V>> input)
Override this method to specify how this
PTransform should be expanded
on the given InputT. |
boolean |
fewKeys()
Returns whether it groups just few keys.
|
static <K,V> Coder<V> |
getInputValueCoder(Coder<KV<K,V>> inputCoder)
Returns the
Coder of the values of the input to this transform. |
static <K,V> Coder<K> |
getKeyCoder(Coder<KV<K,V>> inputCoder)
Returns the
Coder of the keys of the input to this
transform, which is also used as the Coder of the keys of
the output of this transform. |
static <K,V> KvCoder<K,java.lang.Iterable<V>> |
getOutputKvCoder(Coder<KV<K,V>> inputCoder)
Returns the
Coder of the output of this transform. |
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
WindowingStrategy<?,?> |
updateWindowingStrategy(WindowingStrategy<?,?> inputStrategy) |
getAdditionalInputs, getDefaultOutputCoder, getDefaultOutputCoder, getDefaultOutputCoder, getKindString, getName, toString, validatepublic static <K,V> GroupByKey<K,V> create()
GroupByKey<K, V> PTransform.K - the type of the keys of the input and output
PCollectionsV - the type of the values of the input PCollection
and the elements of the Iterables in the output
PCollectionpublic boolean fewKeys()
public static void applicableTo(PCollection<?> input)
public WindowingStrategy<?,?> updateWindowingStrategy(WindowingStrategy<?,?> inputStrategy)
public PCollection<KV<K,java.lang.Iterable<V>>> expand(PCollection<KV<K,V>> input)
PTransformPTransform should be expanded
on the given InputT.
NOTE: This method should not be called directly. Instead apply the
PTransform should be applied to the InputT using the apply
method.
Composite transforms, which are defined in terms of other transforms, should return the output of one of the composed transforms. Non-composite transforms, which do not apply any transforms internally, should return a new unbound output and register evaluators (via backend-specific registration methods).
expand in class PTransform<PCollection<KV<K,V>>,PCollection<KV<K,java.lang.Iterable<V>>>>public static <K,V> Coder<K> getKeyCoder(Coder<KV<K,V>> inputCoder)
Coder of the keys of the input to this
transform, which is also used as the Coder of the keys of
the output of this transform.public static <K,V> Coder<V> getInputValueCoder(Coder<KV<K,V>> inputCoder)
Coder of the values of the input to this transform.public static <K,V> KvCoder<K,java.lang.Iterable<V>> getOutputKvCoder(Coder<KV<K,V>> inputCoder)
Coder of the output of this transform.public void populateDisplayData(DisplayData.Builder builder)
PTransformpopulateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData). Implementations may call
super.populateDisplayData(builder) in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use
the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData in interface HasDisplayDatapopulateDisplayData in class PTransform<PCollection<KV<K,V>>,PCollection<KV<K,java.lang.Iterable<V>>>>builder - The builder to populate with display data.HasDisplayData