apache_beam.transforms.util module¶
Simple utility PTransforms.
-
class
apache_beam.transforms.util.
CoGroupByKey
(**kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Groups results across several PCollections by key.
Given an input dict of serializable keys (called “tags”) to 0 or more PCollections of (key, value) tuples, it creates a single output PCollection of (key, value) tuples whose keys are the unique input keys from all inputs, and whose values are dicts mapping each tag to an iterable of whatever values were under the key in the corresponding PCollection, in this manner:
('some key', {'tag1': ['value 1 under "some key" in pcoll1', 'value 2 under "some key" in pcoll1', ...], 'tag2': ... , ... })
For example, given:
{'tag1': pc1, 'tag2': pc2, 333: pc3}
where:
pc1 = [(k1, v1)] pc2 = [] pc3 = [(k1, v31), (k1, v32), (k2, v33)]
The output PCollection would be:
[(k1, {'tag1': [v1], 'tag2': [], 333: [v31, v32]}), (k2, {'tag1': [], 'tag2': [], 333: [v33]})]
CoGroupByKey also works for tuples, lists, or other flat iterables of PCollections, in which case the values of the resulting PCollections will be tuples whose nth value is the list of values from the nth PCollection—conceptually, the “tags” are the indices into the input. Thus, for this input:
(pc1, pc2, pc3)
the output would be:
[(k1, ([v1], [], [v31, v32]), (k2, ([], [], [v33]))]
-
**kwargs
Accepts a single named argument “pipeline”, which specifies the pipeline that “owns” this PTransform. Ordinarily CoGroupByKey can obtain this information from one of the input PCollections, but if there are none (or if there’s a chance there may be none), this argument is the only way to provide pipeline information, and should be considered mandatory.
-
-
apache_beam.transforms.util.
Keys
(label='Keys')[source]¶ Produces a PCollection of first elements of 2-tuples in a PCollection.
-
apache_beam.transforms.util.
Values
(label='Values')[source]¶ Produces a PCollection of second elements of 2-tuples in a PCollection.
-
apache_beam.transforms.util.
KvSwap
(label='KvSwap')[source]¶ Produces a PCollection reversing 2-tuples in a PCollection.
-
apache_beam.transforms.util.
Distinct
(pcoll)[source]¶ Produces a PCollection containing distinct elements of a PCollection.
-
apache_beam.transforms.util.
RemoveDuplicates
(pcoll)[source]¶ Produces a PCollection containing distinct elements of a PCollection.
-
class
apache_beam.transforms.util.
BatchElements
(min_batch_size=1, max_batch_size=10000, target_batch_overhead=0.05, target_batch_duration_secs=1, variance=0.25, clock=<built-in function time>)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A Transform that batches elements for amortized processing.
This transform is designed to precede operations whose processing cost is of the form
time = fixed_cost + num_elements * per_element_costwhere the per element cost is (often significantly) smaller than the fixed cost and could be amortized over multiple elements. It consumes a PCollection of element type T and produces a PCollection of element type List[T].
This transform attempts to find the best batch size between the minimim and maximum parameters by profiling the time taken by (fused) downstream operations. For a fixed batch size, set the min and max to be equal.
Elements are batched per-window and batches emitted in the window corresponding to its contents.
Parameters: - min_batch_size – (optional) the smallest number of elements per batch
- max_batch_size – (optional) the largest number of elements per batch
- target_batch_overhead – (optional) a target for fixed_cost / time, as used in the formula above
- target_batch_duration_secs – (optional) a target for total time per bundle, in seconds
- variance – (optional) the permitted (relative) amount of deviation from the (estimated) ideal batch size used to produce a wider base for linear interpolation
- clock – (optional) an alternative to time.time for measuring the cost of donwstream operations (mostly for testing)
-
class
apache_beam.transforms.util.
Reshuffle
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
PTransform that returns a PCollection equivalent to its input, but operationally provides some of the side effects of a GroupByKey, in particular preventing fusion of the surrounding transforms, checkpointing, and deduplication by id.
Reshuffle adds a temporary random key to each element, performs a ReshufflePerKey, and finally removes the temporary key.
Reshuffle is experimental. No backwards compatibility guarantees.
-
apache_beam.transforms.util.
WithKeys
(pcoll, k)[source]¶ PTransform that takes a PCollection, and either a constant key or a callable, and returns a PCollection of (K, V), where each of the values in the input PCollection has been paired with either the constant key or a key computed from the value.
-
class
apache_beam.transforms.util.
ToString
[source]¶ Bases:
future.types.newobject.newobject
PTransform for converting a PCollection element, KV or PCollection Iterable to string.
-
class
Kvs
(delimiter=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Transforms each element of the PCollection to a string on the key followed by the specific delimiter and the value.
-
class
Element
(delimiter=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Transforms each element of the PCollection to a string.
-
class
Iterables
(delimiter=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Transforms each item in the iterable of the input of PCollection to a string. There is no trailing delimiter.
-
class
-
class
apache_beam.transforms.util.
Reify
[source]¶ Bases:
future.types.newobject.newobject
PTransforms for converting between explicit and implicit form of various Beam values.
-
class
Timestamp
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
PTransform to wrap a value in a TimestampedValue with it’s associated timestamp.
-
class
Window
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
PTransform to convert an element in a PCollection into a tuple of (element, timestamp, window), wrapped in a TimestampedValue with it’s associated timestamp.
-
class
TimestampInValue
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
PTransform to wrap the Value in a KV pair in a TimestampedValue with the element’s associated timestamp.
-
class
WindowInValue
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
PTransform to convert the Value in a KV pair into a tuple of (value, timestamp, window), with the whole element being wrapped inside a TimestampedValue.
-
class