apache_beam.transforms.core module

Core PTransform subclasses, such as FlatMap, GroupByKey, and Map.

class apache_beam.transforms.core.RestrictionProvider[source]

Bases: future.types.newobject.newobject

Provides methods for generating and manipulating restrictions.

This class should be implemented to support Splittable DoFn in Python SDK. See https://s.apache.org/splittable-do-fn for more details about Splittable DoFn.

To denote a DoFn class to be Splittable DoFn, DoFn.process() method of that class should have exactly one parameter whose default value is an instance of RestrictionProvider.

The provided RestrictionProvider instance must provide suitable overrides for the following methods: * create_tracker() * initial_restriction()

Optionally, RestrictionProvider may override default implementations of following methods: * restriction_coder() * restriction_size() * split() * split_and_size()

** Pausing and resuming processing of an element **

As the last element produced by the iterator returned by the DoFn.process() method, a Splittable DoFn may return an object of type ProcessContinuation.

If provided, ProcessContinuation object specifies that runner should later re-invoke DoFn.process() method to resume processing the current element and the manner in which the re-invocation should be performed. A ProcessContinuation object must only be specified as the last element of the iterator. If a ProcessContinuation object is not provided the runner will assume that the current input element has been fully processed.

** Updating output watermark **

DoFn.process() method of Splittable DoFn``s could contain a parameter with default value ``DoFn.WatermarkReporterParam. If specified this asks the runner to provide a function that can be used to give the runner a (best-effort) lower bound about the timestamps of future output associated with the current element processed by the DoFn. If the DoFn has multiple outputs, the watermark applies to all of them. Provided function must be invoked with a single parameter of type Timestamp or as an integer that gives the watermark in number of seconds.

create_tracker(restriction)[source]

Produces a new RestrictionTracker for the given restriction.

Parameters:restriction – an object that defines a restriction as identified by a Splittable DoFn that utilizes the current RestrictionProvider. For example, a tuple that gives a range of positions for a Splittable DoFn that reads files based on byte positions.

Returns: an object of type RestrictionTracker.

initial_restriction(element)[source]

Produces an initial restriction for the given element.

split(element, restriction)[source]

Splits the given element and restriction.

Returns an iterator of restrictions. The total set of elements produced by reading input element for each of the returned restrictions should be the same as the total set of elements produced by reading the input element for the input restriction.

restriction_coder()[source]

Returns a Coder for restrictions.

Returned``Coder`` will be used for the restrictions produced by the current RestrictionProvider.

Returns:an object of type Coder.
restriction_size(element, restriction)[source]

Returns the size of an element with respect to the given element.

By default, asks a newly-created restriction tracker for the default size of the restriction.

split_and_size(element, restriction)[source]

Like split, but also does sizing, returning (restriction, size) pairs.

class apache_beam.transforms.core.DoFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.typehints.decorators.WithTypeHints, apache_beam.transforms.display.HasDisplayData, apache_beam.utils.urns.RunnerApiFn

A function object used by a transform with custom processing.

The ParDo transform is such a transform. The ParDo.apply method will take an object of type DoFn and apply it to all elements of a PCollection object.

In order to have concrete DoFn objects one has to subclass from DoFn and define the desired behavior (start_bundle/finish_bundle and process) or wrap a callable object using the CallableWrapperDoFn class.

ElementParam = ElementParam
SideInputParam = SideInputParam
TimestampParam = TimestampParam
WindowParam = WindowParam
WatermarkReporterParam = WatermarkReporterParam
BundleFinalizerParam

alias of _BundleFinalizerParam

DoFnProcessParams = [ElementParam, SideInputParam, TimestampParam, WindowParam, WatermarkReporterParam, <class 'apache_beam.transforms.core._BundleFinalizerParam'>]
StateParam

alias of _StateDoFnParam

TimerParam

alias of _TimerDoFnParam

RestrictionParam

alias of _RestrictionDoFnParam

static from_callable(fn)[source]
default_label()[source]
process(element, *args, **kwargs)[source]

Method to use for processing elements.

This is invoked by DoFnRunner for each element of a input PCollection.

If specified, following default arguments are used by the DoFnRunner to be able to pass the parameters correctly.

DoFn.ElementParam: element to be processed, should not be mutated. DoFn.SideInputParam: a side input that may be used when processing. DoFn.TimestampParam: timestamp of the input element. DoFn.WindowParam: Window the input element belongs to. DoFn.TimerParam: a userstate.RuntimeTimer object defined by the spec of the parameter. DoFn.StateParam: a userstate.RuntimeState object defined by the spec of the parameter. DoFn.RestrictionParam: an iobase.RestrictionTracker will be provided here to allow treatment as a Splittable DoFn. The restriction tracker will be derived from the restriction provider in the parameter. DoFn.WatermarkReporterParam: a function that can be used to report output watermark of Splittable DoFn implementations.

Parameters:
  • element – The element to be processed
  • *args – side inputs
  • **kwargs – other keyword arguments.
start_bundle()[source]

Called before a bundle of elements is processed on a worker.

Elements to be processed are split into bundles and distributed to workers. Before a worker calls process() on the first element of its bundle, it calls this method.

finish_bundle()[source]

Called after a bundle of elements is processed on a worker.

get_function_arguments(func)[source]
infer_output_type(input_type)[source]
is_process_bounded()[source]

Checks if an object is a bound method on an instance.

to_runner_api_parameter(context)
class apache_beam.transforms.core.CombineFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.typehints.decorators.WithTypeHints, apache_beam.transforms.display.HasDisplayData, apache_beam.utils.urns.RunnerApiFn

A function object used by a Combine transform with custom processing.

A CombineFn specifies how multiple values in all or part of a PCollection can be merged into a single value—essentially providing the same kind of information as the arguments to the Python “reduce” builtin (except for the input argument, which is an instance of CombineFnProcessContext). The combining process proceeds as follows:

  1. Input values are partitioned into one or more batches.
  2. For each batch, the create_accumulator method is invoked to create a fresh initial “accumulator” value representing the combination of zero values.
  3. For each input value in the batch, the add_input method is invoked to combine more values with the accumulator for that batch.
  4. The merge_accumulators method is invoked to combine accumulators from separate batches into a single combined output accumulator value, once all of the accumulators have had all the input value in their batches added to them. This operation is invoked repeatedly, until there is only one accumulator value left.
  5. The extract_output operation is invoked on the final accumulator to get the output value.

Note: If this CombineFn is used with a transform that has defaults, apply will be called with an empty list at expansion time to get the default value.

default_label()[source]
create_accumulator(*args, **kwargs)[source]

Return a fresh, empty accumulator for the combine operation.

Parameters:
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
add_input(mutable_accumulator, element, *args, **kwargs)[source]

Return result of folding element into accumulator.

CombineFn implementors must override add_input.

Parameters:
  • mutable_accumulator – the current accumulator, may be modified and returned for efficiency
  • element – the element to add, should not be mutated
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
add_inputs(mutable_accumulator, elements, *args, **kwargs)[source]

Returns the result of folding each element in elements into accumulator.

This is provided in case the implementation affords more efficient bulk addition of elements. The default implementation simply loops over the inputs invoking add_input for each one.

Parameters:
  • mutable_accumulator – the current accumulator, may be modified and returned for efficiency
  • elements – the elements to add, should not be mutated
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
merge_accumulators(accumulators, *args, **kwargs)[source]

Returns the result of merging several accumulators to a single accumulator value.

Parameters:
  • accumulators – the accumulators to merge. Only the first accumulator may be modified and returned for efficiency; the other accumulators should not be mutated, because they may be shared with other code and mutating them could lead to incorrect results or data corruption.
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
compact(accumulator, *args, **kwargs)[source]

Optionally returns a more compact represenation of the accumulator.

This is called before an accumulator is sent across the wire, and can be useful in cases where values are buffered or otherwise lazily kept unprocessed when added to the accumulator. Should return an equivalent, though possibly modified, accumulator.

By default returns the accumulator unmodified.

Parameters:
  • accumulator – the current accumulator
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
extract_output(accumulator, *args, **kwargs)[source]

Return result of converting accumulator into the output value.

Parameters:
  • accumulator – the final accumulator value computed by this CombineFn for the entire input key or PCollection. Can be modified for efficiency.
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
apply(elements, *args, **kwargs)[source]

Returns result of applying this CombineFn to the input values.

Parameters:
  • elements – the set of values to combine.
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
for_input_type(input_type)[source]

Returns a specialized implementation of self, if it exists.

Otherwise, returns self.

Parameters:input_type – the type of input elements.
static from_callable(fn)[source]
static maybe_from_callable(fn, has_side_inputs=True)[source]
get_accumulator_coder()[source]
to_runner_api_parameter(context)
class apache_beam.transforms.core.PartitionFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.typehints.decorators.WithTypeHints

A function object used by a Partition transform.

A PartitionFn specifies how individual values in a PCollection will be placed into separate partitions, indexed by an integer.

default_label()[source]
partition_for(element, num_partitions, *args, **kwargs)[source]

Specify which partition will receive this element.

Parameters:
  • element – An element of the input PCollection.
  • num_partitions – Number of partitions, i.e., output PCollections.
  • *args – optional parameters and side inputs.
  • **kwargs – optional parameters and side inputs.
Returns:

An integer in [0, num_partitions).

class apache_beam.transforms.core.ParDo(fn, *args, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransformWithSideInputs

A ParDo transform.

Processes an input PCollection by applying a DoFn to each element and returning the accumulated results into an output PCollection. The type of the elements is not fixed as long as the DoFn can deal with it. In reality the type is restrained to some extent because the elements sometimes must be persisted to external storage. See the expand() method comments for a detailed description of all possible arguments.

Note that the DoFn must return an iterable for each element of the input PCollection. An easy way to do this is to use the yield keyword in the process method.

Parameters:
  • pcoll (PCollection) – a PCollection to be processed.
  • fn (DoFn) – a DoFn object to be applied to each element of pcoll argument.
  • *args – positional arguments passed to the DoFn object.
  • **kwargs – keyword arguments passed to the DoFn object.

Note that the positional and keyword arguments will be processed in order to detect PCollection s that will be computed as side inputs to the transform. During pipeline execution whenever the DoFn object gets executed (its DoFn.process() method gets called) the PCollection arguments will be replaced by values from the PCollection in the exact positions where they appear in the argument lists.

default_type_hints()[source]
infer_output_type(input_type)[source]
make_fn(fn, has_side_inputs)[source]
display_data()[source]
expand(pcoll)[source]
with_outputs(*tags, **main_kw)[source]

Returns a tagged tuple allowing access to the outputs of a ParDo.

The resulting object supports access to the PCollection associated with a tag (e.g. o.tag, o[tag]) and iterating over the available tags (e.g. for tag in o: ...).

Parameters:
  • *tags – if non-empty, list of valid tags. If a list of valid tags is given, it will be an error to use an undeclared tag later in the pipeline.
  • **main_kw – dictionary empty or with one key 'main' defining the tag to be used for the main output (which will not have a tag associated with it).
Returns:

An object of type DoOutputsTuple that bundles together all the outputs of a ParDo transform and allows accessing the individual PCollection s for each output using an object.tag syntax.

Return type:

DoOutputsTuple

Raises:
to_runner_api_parameter(context)[source]
static from_runner_api_parameter(pardo_payload, context)[source]
runner_api_requires_keyed_input()[source]
get_restriction_coder()[source]

Returns restriction coder if `DoFn of this ParDo is a SDF.

Returns None otherwise.

apache_beam.transforms.core.FlatMap(fn, *args, **kwargs)[source]

FlatMap() is like ParDo except it takes a callable to specify the transformation.

The callable must return an iterable for each element of the input PCollection. The elements of these iterables will be flattened into the output PCollection.

Parameters:
  • fn (callable) – a callable object.
  • *args – positional arguments passed to the transform callable.
  • **kwargs – keyword arguments passed to the transform callable.
Returns:

A PCollection containing the FlatMap() outputs.

Return type:

PCollection

Raises:

TypeError – If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for ParDo.

apache_beam.transforms.core.Map(fn, *args, **kwargs)[source]

Map() is like FlatMap() except its callable returns only a single element.

Parameters:
  • fn (callable) – a callable object.
  • *args – positional arguments passed to the transform callable.
  • **kwargs – keyword arguments passed to the transform callable.
Returns:

A PCollection containing the Map() outputs.

Return type:

PCollection

Raises:

TypeError – If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for ParDo.

apache_beam.transforms.core.Filter(fn, *args, **kwargs)[source]

Filter() is a FlatMap() with its callable filtering out elements.

Parameters:
  • fn (callable) – a callable object.
  • *args – positional arguments passed to the transform callable.
  • **kwargs – keyword arguments passed to the transform callable.
Returns:

A PCollection containing the Filter() outputs.

Return type:

PCollection

Raises:

TypeError – If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for ParDo.

class apache_beam.transforms.core.CombineGlobally(fn, *args, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A CombineGlobally transform.

Reduces a PCollection to a single value by progressively applying a CombineFn to portions of the PCollection (and to intermediate values created thereby). See documentation in CombineFn for details on the specifics on how CombineFn s are applied.

Parameters:
  • pcoll (PCollection) – a PCollection to be reduced into a single value.
  • fn (callable) – a CombineFn object that will be called to progressively reduce the PCollection into single values, or a callable suitable for wrapping by CallableWrapperCombineFn.
  • *args – positional arguments passed to the CombineFn object.
  • **kwargs – keyword arguments passed to the CombineFn object.
Raises:

TypeError – If the output type of the input PCollection is not compatible with Iterable[A].

Returns:

A single-element PCollection containing the main output of the CombineGlobally transform.

Return type:

PCollection

Note that the positional and keyword arguments will be processed in order to detect PValue s that will be computed as side inputs to the transform. During pipeline execution whenever the CombineFn object gets executed (i.e. any of the CombineFn methods get called), the PValue arguments will be replaced by their actual value in the exact position where they appear in the argument lists.

has_defaults = True
as_view = False
fanout = None
display_data()[source]
default_label()[source]
with_fanout(fanout)[source]
with_defaults(has_defaults=True)[source]
without_defaults()[source]
as_singleton_view()[source]
expand(pcoll)[source]
class apache_beam.transforms.core.CombinePerKey(fn, *args, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransformWithSideInputs

A per-key Combine transform.

Identifies sets of values associated with the same key in the input PCollection, then applies a CombineFn to condense those sets to single values. See documentation in CombineFn for details on the specifics on how CombineFns are applied.

Parameters:
  • pcoll – input pcollection.
  • fn – instance of CombineFn to apply to all values under the same key in pcoll, or a callable whose signature is f(iterable, *args, **kwargs) (e.g., sum, max).
  • *args – arguments and side inputs, passed directly to the CombineFn.
  • **kwargs – arguments and side inputs, passed directly to the CombineFn.
Returns:

A PObject holding the result of the combine operation.

with_hot_key_fanout(fanout)[source]

A per-key combine operation like self but with two levels of aggregation.

If a given key is produced by too many upstream bundles, the final reduction can become a bottleneck despite partial combining being lifted pre-GroupByKey. In these cases it can be helpful to perform intermediate partial aggregations in parallel and then re-group to peform a final (per-key) combine. This is also useful for high-volume keys in streaming where combiners are not generally lifted for latency reasons.

Note that a fanout greater than 1 requires the data to be sent through two GroupByKeys, and a high fanout can also result in more shuffle data due to less per-bundle combining. Setting the fanout for a key at 1 or less places values on the “cold key” path that skip the intermediate level of aggregation.

Parameters:fanout – either None, for no fanout, an int, for a constant-degree fanout, or a callable mapping keys to a key-specific degree of fanout.
Returns:A per-key combining PTransform with the specified fanout.
display_data()[source]
make_fn(fn, has_side_inputs)[source]
default_label()[source]
expand(pcoll)[source]
default_type_hints()[source]
to_runner_api_parameter(context)[source]
static from_runner_api_parameter(combine_payload, context)[source]
runner_api_requires_keyed_input()[source]
class apache_beam.transforms.core.CombineValues(fn, *args, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransformWithSideInputs

make_fn(fn, has_side_inputs)[source]
expand(pcoll)[source]
to_runner_api_parameter(context)[source]
static from_runner_api_parameter(combine_payload, context)[source]
class apache_beam.transforms.core.GroupByKey(label=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A group by key transform.

Processes an input PCollection consisting of key/value pairs represented as a tuple pair. The result is a PCollection where values having a common key are grouped together. For example (a, 1), (b, 2), (a, 3) will result into (a, [1, 3]), (b, [2]).

The implementation here is used only when run on the local direct runner.

class ReifyWindows(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.core.DoFn

process(element, window=WindowParam, timestamp=TimestampParam)[source]
infer_output_type(input_type)[source]
expand(pcoll)[source]
infer_output_type(input_type)[source]
to_runner_api_parameter(unused_context)[source]
static from_runner_api_parameter(unused_payload, unused_context)[source]
runner_api_requires_keyed_input()[source]
class apache_beam.transforms.core.Partition(fn, *args, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransformWithSideInputs

Split a PCollection into several partitions.

Uses the specified PartitionFn to separate an input PCollection into the specified number of sub-PCollections.

When apply()d, a Partition() PTransform requires the following:

Parameters:
  • partitionfn – a PartitionFn, or a callable with the signature described in CallableWrapperPartitionFn.
  • n – number of output partitions.

The result of this PTransform is a simple list of the output PCollections representing each of n partitions, in order.

class ApplyPartitionFnFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.core.DoFn

A DoFn that applies a PartitionFn.

process(element, partitionfn, n, *args, **kwargs)[source]
make_fn(fn, has_side_inputs)[source]
expand(pcoll)[source]
class apache_beam.transforms.core.Windowing(windowfn, triggerfn=None, accumulation_mode=None, timestamp_combiner=None)[source]

Bases: future.types.newobject.newobject

is_default()[source]
to_runner_api(context)[source]
static from_runner_api(proto, context)[source]
class apache_beam.transforms.core.WindowInto(windowfn, **kwargs)[source]

Bases: apache_beam.transforms.core.ParDo

A window transform assigning windows to each element of a PCollection.

Transforms an input PCollection by applying a windowing function to each element. Each transformed element in the result will be a WindowedValue element with the same input value and timestamp, with its new set of windows determined by the windowing function.

Initializes a WindowInto transform.

Parameters:
  • windowfn – Function to be used for windowing
  • trigger – (optional) Trigger used for windowing, or None for default.
  • accumulation_mode – (optional) Accumulation mode used for windowing, required for non-trivial triggers.
  • timestamp_combiner – (optional) Timestamp combniner used for windowing, or None for default.
class WindowIntoFn(windowing)[source]

Bases: apache_beam.transforms.core.DoFn

A DoFn that applies a WindowInto operation.

process(element, timestamp=TimestampParam, window=WindowParam)[source]
get_windowing(unused_inputs)[source]
infer_output_type(input_type)[source]
expand(pcoll)[source]
to_runner_api_parameter(context)[source]
static from_runner_api_parameter(proto, context)[source]
class apache_beam.transforms.core.Flatten(**kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

Merges several PCollections into a single PCollection.

Copies all elements in 0 or more PCollections into a single output PCollection. If there are no input PCollections, the resulting PCollection will be empty (but see also kwargs below).

Parameters:**kwargs – Accepts a single named argument “pipeline”, which specifies the pipeline that “owns” this PTransform. Ordinarily Flatten can obtain this information from one of the input PCollections, but if there are none (or if there’s a chance there may be none), this argument is the only way to provide pipeline information and should be considered mandatory.
expand(pcolls)[source]
get_windowing(inputs)[source]
to_runner_api_parameter(context)[source]
static from_runner_api_parameter(unused_parameter, unused_context)[source]
class apache_beam.transforms.core.Create(values, reshuffle=True)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A transform that creates a PCollection from an iterable.

Initializes a Create transform.

Parameters:values – An object of values for the PCollection
to_runner_api_parameter(context)[source]
infer_output_type(unused_input_type)[source]
get_output_type()[source]
expand(pbegin)[source]
get_windowing(unused_inputs)[source]
class apache_beam.transforms.core.Impulse(label=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

Impulse primitive.

expand(pbegin)[source]
get_windowing(inputs)[source]
infer_output_type(unused_input_type)[source]
to_runner_api_parameter(unused_context)[source]
static from_runner_api_parameter(unused_parameter, unused_context)[source]