apache_beam.transforms.util module

Simple utility PTransforms.

class apache_beam.transforms.util.CoGroupByKey(**kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

Groups results across several PCollections by key.

Given an input dict mapping serializable keys (called “tags”) to 0 or more PCollections of (key, value) tuples, e.g.:

{'pc1': pcoll1, 'pc2': pcoll2, 33333: pcoll3}

creates a single output PCollection of (key, value) tuples whose keys are the unique input keys from all inputs, and whose values are dicts mapping each tag to an iterable of whatever values were under the key in the corresponding PCollection:

('some key', {'pc1': ['value 1 under "some key" in pcoll1',
                      'value 2 under "some key" in pcoll1'],
              'pc2': [],
              33333: ['only value under "some key" in pcoll3']})

Note that pcoll2 had no values associated with “some key”.

CoGroupByKey also works for tuples, lists, or other flat iterables of PCollections, in which case the values of the resulting PCollections will be tuples whose nth value is the list of values from the nth PCollection—conceptually, the “tags” are the indices into the input. Thus, for this input:

(pcoll1, pcoll2, pcoll3)

the output PCollection’s value for “some key” is:

('some key', (['value 1 under "some key" in pcoll1',
               'value 2 under "some key" in pcoll1'],
              [],
              ['only value under "some key" in pcoll3']))
Parameters:
  • label – name of this transform instance. Useful while monitoring and debugging a pipeline execution.
  • **kwargs – Accepts a single named argument “pipeline”, which specifies the pipeline that “owns” this PTransform. Ordinarily CoGroupByKey can obtain this information from one of the input PCollections, but if there are none (or if there’s a chance there may be none), this argument is the only way to provide pipeline information, and should be considered mandatory.
expand(pcolls)[source]

Performs CoGroupByKey on argument pcolls; see class docstring.

apache_beam.transforms.util.Keys(label='Keys')[source]

Produces a PCollection of first elements of 2-tuples in a PCollection.

apache_beam.transforms.util.Values(label='Values')[source]

Produces a PCollection of second elements of 2-tuples in a PCollection.

apache_beam.transforms.util.KvSwap(label='KvSwap')[source]

Produces a PCollection reversing 2-tuples in a PCollection.

apache_beam.transforms.util.RemoveDuplicates(*args, **kwargs)