apache_beam.transforms.core module¶
Core PTransform subclasses, such as FlatMap, GroupByKey, and Map.
-
class
apache_beam.transforms.core.
RestrictionProvider
[source]¶ Bases:
object
Provides methods for generating and manipulating restrictions.
This class should be implemented to support Splittable
DoFn
in Python SDK. See https://s.apache.org/splittable-do-fn for more details about SplittableDoFn
.To denote a
DoFn
class to be SplittableDoFn
,DoFn.process()
method of that class should have exactly one parameter whose default value is an instance ofRestrictionProvider
.The provided
RestrictionProvider
instance must provide suitable overrides for the following methods: * create_tracker() * initial_restriction()Optionally,
RestrictionProvider
may override default implementations of following methods: * restriction_coder() * restriction_size() * split() * split_and_size()** Pausing and resuming processing of an element **
As the last element produced by the iterator returned by the
DoFn.process()
method, a SplittableDoFn
may return an object of typeProcessContinuation
.If provided,
ProcessContinuation
object specifies that runner should later re-invokeDoFn.process()
method to resume processing the current element and the manner in which the re-invocation should be performed. AProcessContinuation
object must only be specified as the last element of the iterator. If aProcessContinuation
object is not provided the runner will assume that the current input element has been fully processed.** Updating output watermark **
DoFn.process()
method of SplittableDoFn``s could contain a parameter with default value ``DoFn.WatermarkReporterParam
. If specified this asks the runner to provide a function that can be used to give the runner a (best-effort) lower bound about the timestamps of future output associated with the current element processed by theDoFn
. If theDoFn
has multiple outputs, the watermark applies to all of them. Provided function must be invoked with a single parameter of typeTimestamp
or as an integer that gives the watermark in number of seconds.-
create_tracker
(restriction)[source]¶ Produces a new
RestrictionTracker
for the given restriction.This API is required to be implemented.
Parameters: restriction – an object that defines a restriction as identified by a Splittable DoFn
that utilizes the currentRestrictionProvider
. For example, a tuple that gives a range of positions for a SplittableDoFn
that reads files based on byte positions.Returns: an object of type
RestrictionTracker
.
-
initial_restriction
(element)[source]¶ Produces an initial restriction for the given element.
This API is required to be implemented.
-
split
(element, restriction)[source]¶ Splits the given element and restriction.
Returns an iterator of restrictions. The total set of elements produced by reading input element for each of the returned restrictions should be the same as the total set of elements produced by reading the input element for the input restriction.
This API is optional if
split_and_size
has been implemented.
-
restriction_coder
()[source]¶ Returns a
Coder
for restrictions.Returned``Coder`` will be used for the restrictions produced by the current
RestrictionProvider
.Returns: an object of type Coder
.
-
-
class
apache_beam.transforms.core.
WatermarkEstimatorProvider
[source]¶ Bases:
object
Provides methods for generating WatermarkEstimator.
This class should be implemented if wanting to providing output_watermark information within an SDF.
In order to make an SDF.process() access to the typical WatermarkEstimator, the SDF author should pass a DoFn.WatermarkEstimatorParam with a default value of one WatermarkEstimatorProvider instance.
-
initial_estimator_state
(element, restriction)[source]¶ Returns the initial state of the WatermarkEstimator with given element and restriction. This function is called by the system.
-
-
class
apache_beam.transforms.core.
DoFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.typehints.decorators.WithTypeHints
,apache_beam.transforms.display.HasDisplayData
,apache_beam.utils.urns.RunnerApiFn
A function object used by a transform with custom processing.
The ParDo transform is such a transform. The ParDo.apply method will take an object of type DoFn and apply it to all elements of a PCollection object.
In order to have concrete DoFn objects one has to subclass from DoFn and define the desired behavior (start_bundle/finish_bundle and process) or wrap a callable object using the CallableWrapperDoFn class.
-
ElementParam
= ElementParam¶
-
SideInputParam
= SideInputParam¶
-
TimestampParam
= TimestampParam¶
-
WindowParam
= WindowParam¶
-
PaneInfoParam
= PaneInfoParam¶
-
WatermarkEstimatorParam
¶ alias of
_WatermarkEstimatorParam
-
BundleFinalizerParam
¶ alias of
_BundleFinalizerParam
-
KeyParam
= KeyParam¶
-
StateParam
¶ alias of
_StateDoFnParam
-
TimerParam
¶ alias of
_TimerDoFnParam
-
DoFnProcessParams
= [ElementParam, SideInputParam, TimestampParam, WindowParam, <class 'apache_beam.transforms.core._WatermarkEstimatorParam'>, PaneInfoParam, <class 'apache_beam.transforms.core._BundleFinalizerParam'>, KeyParam, <class 'apache_beam.transforms.core._StateDoFnParam'>, <class 'apache_beam.transforms.core._TimerDoFnParam'>]¶
-
RestrictionParam
¶ alias of
_RestrictionDoFnParam
-
process
(element, *args, **kwargs)[source]¶ Method to use for processing elements.
This is invoked by
DoFnRunner
for each element of a inputPCollection
.If specified, following default arguments are used by the
DoFnRunner
to be able to pass the parameters correctly.DoFn.ElementParam
: element to be processed, should not be mutated.DoFn.SideInputParam
: a side input that may be used when processing.DoFn.TimestampParam
: timestamp of the input element.DoFn.WindowParam
:Window
the input element belongs to.DoFn.TimerParam
: auserstate.RuntimeTimer
object defined by the spec of the parameter.DoFn.StateParam
: auserstate.RuntimeState
object defined by the spec of the parameter.DoFn.KeyParam
: key associated with the element.DoFn.RestrictionParam
: aniobase.RestrictionTracker
will be provided here to allow treatment as a SplittableDoFn
. The restriction tracker will be derived from the restriction provider in the parameter.DoFn.WatermarkEstimatorParam
: a function that can be used to track output watermark of SplittableDoFn
implementations.Parameters: - element – The element to be processed
- *args – side inputs
- **kwargs – other keyword arguments.
Returns: An Iterable of output elements or None.
-
setup
()[source]¶ Called to prepare an instance for processing bundles of elements.
This is a good place to initialize transient in-memory resources, such as network connections. The resources can then be disposed in
DoFn.teardown
.
-
start_bundle
()[source]¶ Called before a bundle of elements is processed on a worker.
Elements to be processed are split into bundles and distributed to workers. Before a worker calls process() on the first element of its bundle, it calls this method.
-
teardown
()[source]¶ Called to use to clean up this instance before it is discarded.
A runner will do its best to call this method on any given instance to prevent leaks of transient resources, however, there may be situations where this is impossible (e.g. process crash, hardware failure, etc.) or unnecessary (e.g. the pipeline is shutting down and the process is about to be killed anyway, so all transient resources will be released automatically by the OS). In these cases, the call may not happen. It will also not be retried, because in such situations the DoFn instance no longer exists, so there’s no instance to retry it on.
Thus, all work that depends on input elements, and all externally important side effects, must be performed in
DoFn.process
orDoFn.finish_bundle
.
-
to_runner_api_parameter
(context)¶
-
-
class
apache_beam.transforms.core.
CombineFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.typehints.decorators.WithTypeHints
,apache_beam.transforms.display.HasDisplayData
,apache_beam.utils.urns.RunnerApiFn
A function object used by a Combine transform with custom processing.
A CombineFn specifies how multiple values in all or part of a PCollection can be merged into a single value—essentially providing the same kind of information as the arguments to the Python “reduce” builtin (except for the input argument, which is an instance of CombineFnProcessContext). The combining process proceeds as follows:
- Input values are partitioned into one or more batches.
- For each batch, the create_accumulator method is invoked to create a fresh initial “accumulator” value representing the combination of zero values.
- For each input value in the batch, the add_input method is invoked to combine more values with the accumulator for that batch.
- The merge_accumulators method is invoked to combine accumulators from separate batches into a single combined output accumulator value, once all of the accumulators have had all the input value in their batches added to them. This operation is invoked repeatedly, until there is only one accumulator value left.
- The extract_output operation is invoked on the final accumulator to get the output value.
Note: If this CombineFn is used with a transform that has defaults, apply will be called with an empty list at expansion time to get the default value.
-
create_accumulator
(*args, **kwargs)[source]¶ Return a fresh, empty accumulator for the combine operation.
Parameters: - *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
add_input
(mutable_accumulator, element, *args, **kwargs)[source]¶ Return result of folding element into accumulator.
CombineFn implementors must override add_input.
Parameters: - mutable_accumulator – the current accumulator, may be modified and returned for efficiency
- element – the element to add, should not be mutated
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
add_inputs
(mutable_accumulator, elements, *args, **kwargs)[source]¶ Returns the result of folding each element in elements into accumulator.
This is provided in case the implementation affords more efficient bulk addition of elements. The default implementation simply loops over the inputs invoking add_input for each one.
Parameters: - mutable_accumulator – the current accumulator, may be modified and returned for efficiency
- elements – the elements to add, should not be mutated
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
merge_accumulators
(accumulators, *args, **kwargs)[source]¶ Returns the result of merging several accumulators to a single accumulator value.
Parameters: - accumulators – the accumulators to merge. Only the first accumulator may be modified and returned for efficiency; the other accumulators should not be mutated, because they may be shared with other code and mutating them could lead to incorrect results or data corruption.
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
compact
(accumulator, *args, **kwargs)[source]¶ Optionally returns a more compact represenation of the accumulator.
This is called before an accumulator is sent across the wire, and can be useful in cases where values are buffered or otherwise lazily kept unprocessed when added to the accumulator. Should return an equivalent, though possibly modified, accumulator.
By default returns the accumulator unmodified.
Parameters: - accumulator – the current accumulator
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
extract_output
(accumulator, *args, **kwargs)[source]¶ Return result of converting accumulator into the output value.
Parameters: - accumulator – the final accumulator value computed by this CombineFn for the entire input key or PCollection. Can be modified for efficiency.
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
apply
(elements, *args, **kwargs)[source]¶ Returns result of applying this CombineFn to the input values.
Parameters: - elements – the set of values to combine.
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
for_input_type
(input_type)[source]¶ Returns a specialized implementation of self, if it exists.
Otherwise, returns self.
Parameters: input_type – the type of input elements.
-
to_runner_api_parameter
(context)¶
-
class
apache_beam.transforms.core.
PartitionFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.typehints.decorators.WithTypeHints
A function object used by a Partition transform.
A PartitionFn specifies how individual values in a PCollection will be placed into separate partitions, indexed by an integer.
-
partition_for
(element, num_partitions, *args, **kwargs)[source]¶ Specify which partition will receive this element.
Parameters: - element – An element of the input PCollection.
- num_partitions – Number of partitions, i.e., output PCollections.
- *args – optional parameters and side inputs.
- **kwargs – optional parameters and side inputs.
Returns: An integer in [0, num_partitions).
-
-
class
apache_beam.transforms.core.
ParDo
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformWithSideInputs
A
ParDo
transform.Processes an input
PCollection
by applying aDoFn
to each element and returning the accumulated results into an outputPCollection
. The type of the elements is not fixed as long as theDoFn
can deal with it. In reality the type is restrained to some extent because the elements sometimes must be persisted to external storage. See theexpand()
method comments for a detailed description of all possible arguments.Note that the
DoFn
must return an iterable for each element of the inputPCollection
. An easy way to do this is to use theyield
keyword in the process method.Parameters: - pcoll (PCollection) – a
PCollection
to be processed. - fn (typing.Union[DoFn, typing.Callable]) – a
DoFn
object to be applied to each element of pcoll argument, or a Callable. - *args – positional arguments passed to the
DoFn
object. - **kwargs – keyword arguments passed to the
DoFn
object.
Note that the positional and keyword arguments will be processed in order to detect
PCollection
s that will be computed as side inputs to the transform. During pipeline execution whenever theDoFn
object gets executed (itsDoFn.process()
method gets called) thePCollection
arguments will be replaced by values from thePCollection
in the exact positions where they appear in the argument lists.-
with_outputs
(*tags, **main_kw)[source]¶ Returns a tagged tuple allowing access to the outputs of a
ParDo
.The resulting object supports access to the
PCollection
associated with a tag (e.g.o.tag
,o[tag]
) and iterating over the available tags (e.g.for tag in o: ...
).Parameters: - *tags – if non-empty, list of valid tags. If a list of valid tags is given, it will be an error to use an undeclared tag later in the pipeline.
- **main_kw – dictionary empty or with one key
'main'
defining the tag to be used for the main output (which will not have a tag associated with it).
Returns: An object of type
DoOutputsTuple
that bundles together all the outputs of aParDo
transform and allows accessing the individualPCollection
s for each output using anobject.tag
syntax.Return type: DoOutputsTuple
Raises: TypeError
– if the self object is not aPCollection
that is the result of aParDo
transform.ValueError
– if main_kw contains any key other than'main'
.
- pcoll (PCollection) – a
-
apache_beam.transforms.core.
FlatMap
(fn, *args, **kwargs)[source]¶ FlatMap()
is likeParDo
except it takes a callable to specify the transformation.The callable must return an iterable for each element of the input
PCollection
. The elements of these iterables will be flattened into the outputPCollection
.Parameters: - fn (callable) – a callable object.
- *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A
PCollection
containing theFlatMap()
outputs.Return type: Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass aDoFn
instance which is supported only forParDo
.
-
apache_beam.transforms.core.
Map
(fn, *args, **kwargs)[source]¶ Map()
is likeFlatMap()
except its callable returns only a single element.Parameters: - fn (callable) – a callable object.
- *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A
PCollection
containing theMap()
outputs.Return type: Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass aDoFn
instance which is supported only forParDo
.
-
apache_beam.transforms.core.
MapTuple
(fn, *args, **kwargs)[source]¶ MapTuple()
is likeMap()
but expects tuple inputs and flattens them into multiple input arguments.beam.MapTuple(lambda a, b, …: …)is equivalent to Python 2
beam.Map(lambda (a, b, …), …: …)In other words
beam.MapTuple(fn)is equivalent to
beam.Map(lambda element, …: fn(*element, …))This can be useful when processing a PCollection of tuples (e.g. key-value pairs).
Parameters: - fn (callable) – a callable object.
- *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A
PCollection
containing theMapTuple()
outputs.Return type: Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass aDoFn
instance which is supported only forParDo
.
-
apache_beam.transforms.core.
FlatMapTuple
(fn, *args, **kwargs)[source]¶ FlatMapTuple()
is likeFlatMap()
but expects tuple inputs and flattens them into multiple input arguments.beam.FlatMapTuple(lambda a, b, …: …)is equivalent to Python 2
beam.FlatMap(lambda (a, b, …), …: …)In other words
beam.FlatMapTuple(fn)is equivalent to
beam.FlatMap(lambda element, …: fn(*element, …))This can be useful when processing a PCollection of tuples (e.g. key-value pairs).
Parameters: - fn (callable) – a callable object.
- *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A
PCollection
containing theFlatMapTuple()
outputs.Return type: Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass aDoFn
instance which is supported only forParDo
.
-
apache_beam.transforms.core.
Filter
(fn, *args, **kwargs)[source]¶ Filter()
is aFlatMap()
with its callable filtering out elements.Parameters: - fn (
Callable[..., bool]
) – a callable object. First argument will be an element. - *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A
PCollection
containing theFilter()
outputs.Return type: Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass aDoFn
instance which is supported only forParDo
.- fn (
-
class
apache_beam.transforms.core.
CombineGlobally
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A
CombineGlobally
transform.Reduces a
PCollection
to a single value by progressively applying aCombineFn
to portions of thePCollection
(and to intermediate values created thereby). See documentation inCombineFn
for details on the specifics on howCombineFn
s are applied.Parameters: - pcoll (PCollection) – a
PCollection
to be reduced into a single value. - fn (callable) – a
CombineFn
object that will be called to progressively reduce thePCollection
into single values, or a callable suitable for wrapping byCallableWrapperCombineFn
. - *args – positional arguments passed to the
CombineFn
object. - **kwargs – keyword arguments passed to the
CombineFn
object.
Raises: TypeError
– If the output type of the inputPCollection
is not compatible withIterable[A]
.Returns: A single-element
PCollection
containing the main output of theCombineGlobally
transform.Return type: Note that the positional and keyword arguments will be processed in order to detect
PValue
s that will be computed as side inputs to the transform. During pipeline execution whenever theCombineFn
object gets executed (i.e. any of theCombineFn
methods get called), thePValue
arguments will be replaced by their actual value in the exact position where they appear in the argument lists.-
has_defaults
= True¶
-
as_view
= False¶
-
fanout
= None¶
- pcoll (PCollection) – a
-
class
apache_beam.transforms.core.
CombinePerKey
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformWithSideInputs
A per-key Combine transform.
Identifies sets of values associated with the same key in the input PCollection, then applies a CombineFn to condense those sets to single values. See documentation in CombineFn for details on the specifics on how CombineFns are applied.
Parameters: - pcoll – input pcollection.
- fn – instance of CombineFn to apply to all values under the same key in
pcoll, or a callable whose signature is
f(iterable, *args, **kwargs)
(e.g., sum, max). - *args – arguments and side inputs, passed directly to the CombineFn.
- **kwargs – arguments and side inputs, passed directly to the CombineFn.
Returns: A PObject holding the result of the combine operation.
-
with_hot_key_fanout
(fanout)[source]¶ A per-key combine operation like self but with two levels of aggregation.
If a given key is produced by too many upstream bundles, the final reduction can become a bottleneck despite partial combining being lifted pre-GroupByKey. In these cases it can be helpful to perform intermediate partial aggregations in parallel and then re-group to peform a final (per-key) combine. This is also useful for high-volume keys in streaming where combiners are not generally lifted for latency reasons.
Note that a fanout greater than 1 requires the data to be sent through two GroupByKeys, and a high fanout can also result in more shuffle data due to less per-bundle combining. Setting the fanout for a key at 1 or less places values on the “cold key” path that skip the intermediate level of aggregation.
Parameters: fanout – either None, for no fanout, an int, for a constant-degree fanout, or a callable mapping keys to a key-specific degree of fanout. Returns: A per-key combining PTransform with the specified fanout.
-
class
apache_beam.transforms.core.
CombineValues
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformWithSideInputs
-
class
apache_beam.transforms.core.
GroupByKey
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A group by key transform.
Processes an input PCollection consisting of key/value pairs represented as a tuple pair. The result is a PCollection where values having a common key are grouped together. For example (a, 1), (b, 2), (a, 3) will result into (a, [1, 3]), (b, [2]).
The implementation here is used only when run on the local direct runner.
-
class
apache_beam.transforms.core.
Partition
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformWithSideInputs
Split a PCollection into several partitions.
Uses the specified PartitionFn to separate an input PCollection into the specified number of sub-PCollections.
When apply()d, a Partition() PTransform requires the following:
Parameters: - partitionfn – a PartitionFn, or a callable with the signature described in CallableWrapperPartitionFn.
- n – number of output partitions.
The result of this PTransform is a simple list of the output PCollections representing each of n partitions, in order.
-
class
ApplyPartitionFnFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.core.DoFn
A DoFn that applies a PartitionFn.
-
class
apache_beam.transforms.core.
Windowing
(windowfn, triggerfn=None, accumulation_mode=None, timestamp_combiner=None, allowed_lateness=0, environment_id=None)[source]¶ Bases:
object
Class representing the window strategy.
Parameters: - windowfn – Window assign function.
- triggerfn – Trigger function.
- accumulation_mode – a AccumulationMode, controls what to do with data when a trigger fires multiple times.
- timestamp_combiner – a TimestampCombiner, determines how output timestamps of grouping operations are assigned.
- allowed_lateness – Maximum delay in seconds after end of window allowed for any late data to be processed without being discarded directly.
- environment_id – Environment where the current window_fn should be applied in.
-
class
apache_beam.transforms.core.
WindowInto
(windowfn, trigger=None, accumulation_mode=None, timestamp_combiner=None, allowed_lateness=0)[source]¶ Bases:
apache_beam.transforms.core.ParDo
A window transform assigning windows to each element of a PCollection.
Transforms an input PCollection by applying a windowing function to each element. Each transformed element in the result will be a WindowedValue element with the same input value and timestamp, with its new set of windows determined by the windowing function.
Initializes a WindowInto transform.
Parameters: - windowfn (Windowing, WindowFn) – Function to be used for windowing.
- trigger – (optional) Trigger used for windowing, or None for default.
- accumulation_mode – (optional) Accumulation mode used for windowing, required for non-trivial triggers.
- timestamp_combiner – (optional) Timestamp combniner used for windowing, or None for default.
-
class
WindowIntoFn
(windowing)[source]¶ Bases:
apache_beam.transforms.core.DoFn
A DoFn that applies a WindowInto operation.
-
class
apache_beam.transforms.core.
Flatten
(**kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Merges several PCollections into a single PCollection.
Copies all elements in 0 or more PCollections into a single output PCollection. If there are no input PCollections, the resulting PCollection will be empty (but see also kwargs below).
Parameters: **kwargs – Accepts a single named argument “pipeline”, which specifies the pipeline that “owns” this PTransform. Ordinarily Flatten can obtain this information from one of the input PCollections, but if there are none (or if there’s a chance there may be none), this argument is the only way to provide pipeline information and should be considered mandatory.
-
class
apache_beam.transforms.core.
Create
(values, reshuffle=True)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A transform that creates a PCollection from an iterable.
Initializes a Create transform.
Parameters: values – An object of values for the PCollection
-
class
apache_beam.transforms.core.
Impulse
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Impulse primitive.