Class GroupByKey<K,V>
- Type Parameters:
K- the type of the keys of the input and outputPCollectionsV- the type of the values of the inputPCollectionand the elements of theIterables in the outputPCollection
- All Implemented Interfaces:
Serializable,HasDisplayData
GroupByKey<K, V> takes a PCollection<KV<K, V>>, groups the values by key and
windows, and returns a PCollection<KV<K, Iterable<V>>> representing a map from each
distinct key and window of the input PCollection to an Iterable over all the
values associated with that key in the input per window. Absent repeatedly-firing triggering, each key in the output PCollection is unique within each
window.
GroupByKey is analogous to converting a multi-map into a uni-map, and related to
GROUP BY in SQL. It corresponds to the "shuffle" step between the Mapper and the Reducer
in the MapReduce framework.
Two keys of type K are compared for equality not by regular Java Object.equals(java.lang.Object), but instead by first encoding each of the keys using the Coder of the
keys of the input PCollection, and then comparing the encoded bytes. This admits
efficient parallel evaluation. Note that this requires that the Coder of the keys be
deterministic (see Coder.verifyDeterministic()). If the key Coder is not
deterministic, an exception is thrown at pipeline construction time.
By default, the Coder of the keys of the output PCollection is the same as
that of the keys of the input, and the Coder of the elements of the Iterable
values of the output PCollection is the same as the Coder of the values of the
input.
Example of use:
PCollection<KV<String, Doc>> urlDocPairs = ...; PCollection<KV<String, Iterable<Doc>>> urlToDocs = urlDocPairs.apply(GroupByKey.<String, Doc>create()); PCollection<R> results = urlToDocs.apply(ParDo.of(new DoFn<KV<String, Iterable<Doc>>, R>(){@ProcessElement public void processElement(ProcessContext c) { String url = c.element().getKey(); Iterable<Doc> docsWithThatUrl = c.element().getValue(); ... process all docs having that url ... }}));
GroupByKey is a key primitive in data-parallel processing, since it is the main way to
efficiently bring associated data together into one location. It is also a key determiner of the
performance of a data-parallel pipeline.
See CoGroupByKey for a way to group multiple input
PCollections by a common key at once.
See Combine.PerKey for a common pattern of GroupByKey followed by Combine.GroupedValues.
When grouping, windows that can be merged according to the WindowFn of the input
PCollection will be merged together, and a window pane corresponding to the new, merged
window will be created. The items in this pane will be emitted when a trigger fires. By default
this will be when the input sources estimate there will be no more data for the window. See
AfterWatermark for details on the estimation.
The timestamp for each emitted pane is determined by the Window.withTimestampCombiner(TimestampCombiner) windowing operation}. The output
PCollection will have the same WindowFn as the input.
If the input PCollection contains late data or the requested
TriggerFn can fire before the watermark, then there may be multiple elements output by a
GroupByKey that correspond to the same key and window.
If the WindowFn of the input requires merging, it is not valid to apply another
GroupByKey without first applying a new WindowFn or applying Window.remerge().
- See Also:
-
Field Summary
Fields inherited from class org.apache.beam.sdk.transforms.PTransform
annotations, displayData, name, resourceHints -
Method Summary
Modifier and TypeMethodDescriptionstatic voidapplicableTo(PCollection<?> input) static <K,V> GroupByKey <K, V> create()Returns aGroupByKey<K, V>PTransform.PCollection<KV<K, Iterable<V>>> expand(PCollection<KV<K, V>> input) Override this method to specify how thisPTransformshould be expanded on the givenInputT.booleanfewKeys()Returns whether it groups just few keys.static <K,V> Coder <V> getInputValueCoder(Coder<KV<K, V>> inputCoder) Returns theCoderof the values of the input to this transform.static <K,V> Coder <K> getKeyCoder(Coder<KV<K, V>> inputCoder) Returns theCoderof the keys of the input to this transform, which is also used as theCoderof the keys of the output of this transform.getOutputKvCoder(Coder<KV<K, V>> inputCoder) Returns theCoderof the output of this transform.voidpopulateDisplayData(DisplayData.Builder builder) Register display data for the given transform or component.voidFor Beam internal use only.booleanFor Beam internal use only.WindowingStrategy<?, ?> updateWindowingStrategy(WindowingStrategy<?, ?> inputStrategy) voidvalidate(@Nullable PipelineOptions options, Map<TupleTag<?>, PCollection<?>> inputs, Map<TupleTag<?>, PCollection<?>> outputs) Called before running the Pipeline to verify this transform, its inputs, and outputs are fully and correctly specified.Methods inherited from class org.apache.beam.sdk.transforms.PTransform
addAnnotation, compose, compose, getAdditionalInputs, getAnnotations, getDefaultOutputCoder, getDefaultOutputCoder, getDefaultOutputCoder, getKindString, getName, getResourceHints, setDisplayData, setResourceHints, toString, validate
-
Method Details
-
create
Returns aGroupByKey<K, V>PTransform.- Type Parameters:
K- the type of the keys of the input and outputPCollectionsV- the type of the values of the inputPCollectionand the elements of theIterables in the outputPCollection
-
fewKeys
public boolean fewKeys()Returns whether it groups just few keys. -
setInsideGBEK
public void setInsideGBEK()For Beam internal use only. Tells runner that this is an inner GBK inside a GroupByEncryptedKey -
surroundsGBEK
public boolean surroundsGBEK()For Beam internal use only. Tells runner that this is a GBK wrapped around of a GroupByEncryptedKey -
applicableTo
-
validate
public void validate(@Nullable PipelineOptions options, Map<TupleTag<?>, PCollection<?>> inputs, Map<TupleTag<?>, PCollection<?>> outputs) Description copied from class:PTransformCalled before running the Pipeline to verify this transform, its inputs, and outputs are fully and correctly specified.By default, delegates to
PTransform.validate(PipelineOptions).- Overrides:
validatein classPTransform<PCollection<KV<K,V>>, PCollection<KV<K, Iterable<V>>>>
-
updateWindowingStrategy
-
expand
Description copied from class:PTransformOverride this method to specify how thisPTransformshould be expanded on the givenInputT.NOTE: This method should not be called directly. Instead apply the
PTransformshould be applied to theInputTusing theapplymethod.Composite transforms, which are defined in terms of other transforms, should return the output of one of the composed transforms. Non-composite transforms, which do not apply any transforms internally, should return a new unbound output and register evaluators (via backend-specific registration methods).
- Specified by:
expandin classPTransform<PCollection<KV<K,V>>, PCollection<KV<K, Iterable<V>>>>
-
getKeyCoder
Returns theCoderof the keys of the input to this transform, which is also used as theCoderof the keys of the output of this transform. -
getInputValueCoder
Returns theCoderof the values of the input to this transform. -
getOutputKvCoder
Returns theCoderof the output of this transform. -
populateDisplayData
Description copied from class:PTransformRegister display data for the given transform or component.populateDisplayData(DisplayData.Builder)is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData). Implementations may callsuper.populateDisplayData(builder)in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayDatain interfaceHasDisplayData- Overrides:
populateDisplayDatain classPTransform<PCollection<KV<K,V>>, PCollection<KV<K, Iterable<V>>>> - Parameters:
builder- The builder to populate with display data.- See Also:
-