apache_beam.transforms.core module¶
Core PTransform subclasses, such as FlatMap, GroupByKey, and Map.
-
class
apache_beam.transforms.core.
RestrictionProvider
[source]¶ Bases:
object
Provides methods for generating and manipulating restrictions.
This class should be implemented to support Splittable
DoFn
in Python SDK. See https://s.apache.org/splittable-do-fn for more details about SplittableDoFn
.To denote a
DoFn
class to be SplittableDoFn
,DoFn.process()
method of that class should have exactly one parameter whose default value is an instance ofRestrictionParam
. ThisRestrictionParam
can either be constructed with an explicitRestrictionProvider
, or, if noRestrictionProvider
is provided, theDoFn
itself must be aRestrictionProvider
.The provided
RestrictionProvider
instance must provide suitable overrides for the following methods: * create_tracker() * initial_restriction() * restriction_size()Optionally,
RestrictionProvider
may override default implementations of following methods: * restriction_coder() * split() * split_and_size() * truncate()** Pausing and resuming processing of an element **
As the last element produced by the iterator returned by the
DoFn.process()
method, a SplittableDoFn
may return an object of typeProcessContinuation
.If restriction_tracker.defer_remander is called in the
`DoFn.process()
, it means that runner should later re-invokeDoFn.process()
method to resume processing the current element and the manner in which the re-invocation should be performed.** Updating output watermark **
DoFn.process()
method of SplittableDoFn``s could contain a parameter with default value ``DoFn.WatermarkReporterParam
. If specified this asks the runner to provide a function that can be used to give the runner a (best-effort) lower bound about the timestamps of future output associated with the current element processed by theDoFn
. If theDoFn
has multiple outputs, the watermark applies to all of them. Provided function must be invoked with a single parameter of typeTimestamp
or as an integer that gives the watermark in number of seconds.-
create_tracker
(restriction)[source]¶ Produces a new
RestrictionTracker
for the given restriction.This API is required to be implemented.
Parameters: restriction – an object that defines a restriction as identified by a Splittable DoFn
that utilizes the currentRestrictionProvider
. For example, a tuple that gives a range of positions for a SplittableDoFn
that reads files based on byte positions.Returns: an object of type
RestrictionTracker
.
-
initial_restriction
(element)[source]¶ Produces an initial restriction for the given element.
This API is required to be implemented.
-
split
(element, restriction)[source]¶ Splits the given element and restriction initially.
This method enables runners to perform bulk splitting initially allowing for a rapid increase in parallelism. Note that initial split is a different concept from the split during element processing time. Please refer to
iobase.RestrictionTracker.try_split
for details about splitting when the current element and restriction are actively being processed.Returns an iterator of restrictions. The total set of elements produced by reading input element for each of the returned restrictions should be the same as the total set of elements produced by reading the input element for the input restriction.
This API is optional if
split_and_size
has been implemented.If this method is not override, there is no initial splitting happening on each restriction.
-
restriction_coder
()[source]¶ Returns a
Coder
for restrictions.Returned``Coder`` will be used for the restrictions produced by the current
RestrictionProvider
.Returns: an object of type Coder
.
-
restriction_size
(element, restriction)[source]¶ Returns the size of a restriction with respect to the given element.
By default, asks a newly-created restriction tracker for the default size of the restriction.
The return value must be non-negative.
Must be thread safe. Will be invoked concurrently during bundle processing due to runner initiated splitting and progress estimation.
This API is required to be implemented.
-
split_and_size
(element, restriction)[source]¶ Like split, but also does sizing, returning (restriction, size) pairs.
For each pair, size must be non-negative.
This API is optional if
split
andrestriction_size
have been implemented.
-
truncate
(element, restriction)[source]¶ Truncates the provided restriction into a restriction representing a finite amount of work when the pipeline is draining <https://docs.google.com/document/d/1NExwHlj-2q2WUGhSO4jTu8XGhDPmm3cllSN8IMmWci8/edit#> for additional details about drain.>_. # pylint: disable=line-too-long By default, if the restriction is bounded then the restriction will be returned otherwise None will be returned.
This API is optional and should only be implemented if more granularity is required.
Return a truncated finite restriction if further processing is required otherwise return None to represent that no further processing of this restriction is required.
The default behavior when a pipeline is being drained is that bounded restrictions process entirely while unbounded restrictions process till a checkpoint is possible.
-
-
class
apache_beam.transforms.core.
WatermarkEstimatorProvider
[source]¶ Bases:
object
Provides methods for generating WatermarkEstimator.
This class should be implemented if wanting to providing output_watermark information within an SDF.
In order to make an SDF.process() access to the typical WatermarkEstimator, the SDF author should have an argument whose default value is a DoFn.WatermarkEstimatorParam instance. This DoFn.WatermarkEstimatorParam can either be constructed with an explicit WatermarkEstimatorProvider, or, if no WatermarkEstimatorProvider is provided, the DoFn itself must be a WatermarkEstimatorProvider.
-
initial_estimator_state
(element, restriction)[source]¶ Returns the initial state of the WatermarkEstimator with given element and restriction. This function is called by the system.
-
-
class
apache_beam.transforms.core.
DoFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.typehints.decorators.WithTypeHints
,apache_beam.transforms.display.HasDisplayData
,apache_beam.utils.urns.RunnerApiFn
A function object used by a transform with custom processing.
The ParDo transform is such a transform. The ParDo.apply method will take an object of type DoFn and apply it to all elements of a PCollection object.
In order to have concrete DoFn objects one has to subclass from DoFn and define the desired behavior (start_bundle/finish_bundle and process) or wrap a callable object using the CallableWrapperDoFn class.
-
ElementParam
= ElementParam¶
-
SideInputParam
= SideInputParam¶
-
TimestampParam
= TimestampParam¶
-
WindowParam
= WindowParam¶
-
PaneInfoParam
= PaneInfoParam¶
-
WatermarkEstimatorParam
¶ alias of
_WatermarkEstimatorParam
-
BundleFinalizerParam
¶ alias of
_BundleFinalizerParam
-
KeyParam
= KeyParam¶
-
BundleContextParam
¶ alias of
_BundleContextParam
-
SetupContextParam
¶ alias of
_SetupContextParam
-
StateParam
¶ alias of
_StateDoFnParam
-
TimerParam
¶ alias of
_TimerDoFnParam
-
DynamicTimerTagParam
= DynamicTimerTagParam¶
-
DoFnProcessParams
= [ElementParam, SideInputParam, TimestampParam, WindowParam, <class 'apache_beam.transforms.core._WatermarkEstimatorParam'>, PaneInfoParam, <class 'apache_beam.transforms.core._BundleFinalizerParam'>, KeyParam, <class 'apache_beam.transforms.core._StateDoFnParam'>, <class 'apache_beam.transforms.core._TimerDoFnParam'>, <class 'apache_beam.transforms.core._BundleContextParam'>, <class 'apache_beam.transforms.core._SetupContextParam'>]¶
-
RestrictionParam
¶ alias of
_RestrictionDoFnParam
-
static
unbounded_per_element
()[source]¶ A decorator on process fn specifying that the fn performs an unbounded amount of work per input element.
-
static
yields_elements
(fn)[source]¶ A decorator to apply to
process_batch
indicating it yields elements.By default
process_batch
is assumed to both consume and produce “batches”, which are collections of multiple logical Beam elements. This decorator indicates thatprocess_batch
produces individual elements at a time.process_batch
is always expected to consume batches.
-
static
yields_batches
(fn)[source]¶ A decorator to apply to
process
indicating it yields batches.By default
process
is assumed to both consume and produce individual elements at a time. This decorator indicates thatprocess
produces “batches”, which are collections of multiple logical Beam elements.
-
process
(element, *args, **kwargs)[source]¶ Method to use for processing elements.
This is invoked by
DoFnRunner
for each element of a inputPCollection
.The following parameters can be used as default values on
process
arguments to indicate that a DoFn accepts the corresponding parameters. For example, a DoFn might accept the element and its timestamp with the following signature:def process(element=DoFn.ElementParam, timestamp=DoFn.TimestampParam): ...
The full set of parameters is:
DoFn.ElementParam
: element to be processed, should not be mutated.DoFn.SideInputParam
: a side input that may be used when processing.DoFn.TimestampParam
: timestamp of the input element.DoFn.WindowParam
:Window
the input element belongs to.DoFn.TimerParam
: auserstate.RuntimeTimer
object defined by the spec of the parameter.DoFn.StateParam
: auserstate.RuntimeState
object defined by the spec of the parameter.DoFn.KeyParam
: key associated with the element.DoFn.RestrictionParam
: aniobase.RestrictionTracker
will be provided here to allow treatment as a SplittableDoFn
. The restriction tracker will be derived from the restriction provider in the parameter.DoFn.WatermarkEstimatorParam
: a function that can be used to track output watermark of SplittableDoFn
implementations.DoFn.BundleContextParam
: allows a shared context manager to be used per bundleDoFn.SetupContextParam
: allows a shared context manager to be used per DoFn
Parameters: - element – The element to be processed
- *args – side inputs
- **kwargs – other keyword arguments.
Returns: An Iterable of output elements or None.
-
setup
()[source]¶ Called to prepare an instance for processing bundles of elements.
This is a good place to initialize transient in-memory resources, such as network connections. The resources can then be disposed in
DoFn.teardown
.
-
start_bundle
()[source]¶ Called before a bundle of elements is processed on a worker.
Elements to be processed are split into bundles and distributed to workers. Before a worker calls process() on the first element of its bundle, it calls this method.
-
teardown
()[source]¶ Called to use to clean up this instance before it is discarded.
A runner will do its best to call this method on any given instance to prevent leaks of transient resources, however, there may be situations where this is impossible (e.g. process crash, hardware failure, etc.) or unnecessary (e.g. the pipeline is shutting down and the process is about to be killed anyway, so all transient resources will be released automatically by the OS). In these cases, the call may not happen. It will also not be retried, because in such situations the DoFn instance no longer exists, so there’s no instance to retry it on.
Thus, all work that depends on input elements, and all externally important side effects, must be performed in
DoFn.process
orDoFn.finish_bundle
.
-
get_input_batch_type
(input_element_type) → Union[apache_beam.typehints.typehints.TypeConstraint, type, None][source]¶ Determine the batch type expected as input to process_batch.
The default implementation of
get_input_batch_type
simply observes the input typehint for the first parameter ofprocess_batch
. A Batched DoFn may override this method if a dynamic approach is required.Parameters: input_element_type – The element type of the input PCollection this DoFn is being applied to. Returns: None
if this DoFn cannot accept batches, else a Beam typehint or a native Python typehint.
-
get_output_batch_type
(input_element_type) → Union[apache_beam.typehints.typehints.TypeConstraint, type, None][source]¶ Determine the batch type produced by this DoFn’s
process_batch
implementation and/or itsprocess
implementation with@yields_batch
.The default implementation of this method observes the return type annotations on
process_batch
and/orprocess
. A Batched DoFn may override this method if a dynamic approach is required.Parameters: input_element_type – The element type of the input PCollection this DoFn is being applied to. Returns: None
if this DoFn will never yield batches, else a Beam typehint or a native Python typehint.
-
to_runner_api_parameter
(context)¶
-
-
class
apache_beam.transforms.core.
CombineFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.typehints.decorators.WithTypeHints
,apache_beam.transforms.display.HasDisplayData
,apache_beam.utils.urns.RunnerApiFn
A function object used by a Combine transform with custom processing.
A CombineFn specifies how multiple values in all or part of a PCollection can be merged into a single value—essentially providing the same kind of information as the arguments to the Python “reduce” builtin (except for the input argument, which is an instance of CombineFnProcessContext). The combining process proceeds as follows:
- Input values are partitioned into one or more batches.
- For each batch, the setup method is invoked.
- For each batch, the create_accumulator method is invoked to create a fresh initial “accumulator” value representing the combination of zero values.
- For each input value in the batch, the add_input method is invoked to combine more values with the accumulator for that batch.
- The merge_accumulators method is invoked to combine accumulators from separate batches into a single combined output accumulator value, once all of the accumulators have had all the input value in their batches added to them. This operation is invoked repeatedly, until there is only one accumulator value left.
- The extract_output operation is invoked on the final accumulator to get the output value.
- The teardown method is invoked.
Note: If this CombineFn is used with a transform that has defaults, apply will be called with an empty list at expansion time to get the default value.
-
setup
(*args, **kwargs)[source]¶ Called to prepare an instance for combining.
This method can be useful if there is some state that needs to be loaded before executing any of the other methods. The resources can then be disposed of in
CombineFn.teardown
.If you are using Dataflow, you need to enable Dataflow Runner V2 before using this feature.
Parameters: - *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
create_accumulator
(*args, **kwargs)[source]¶ Return a fresh, empty accumulator for the combine operation.
Parameters: - *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
add_input
(mutable_accumulator, element, *args, **kwargs)[source]¶ Return result of folding element into accumulator.
CombineFn implementors must override add_input.
Parameters: - mutable_accumulator – the current accumulator, may be modified and returned for efficiency
- element – the element to add, should not be mutated
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
add_inputs
(mutable_accumulator, elements, *args, **kwargs)[source]¶ Returns the result of folding each element in elements into accumulator.
This is provided in case the implementation affords more efficient bulk addition of elements. The default implementation simply loops over the inputs invoking add_input for each one.
Parameters: - mutable_accumulator – the current accumulator, may be modified and returned for efficiency
- elements – the elements to add, should not be mutated
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
merge_accumulators
(accumulators, *args, **kwargs)[source]¶ Returns the result of merging several accumulators to a single accumulator value.
Parameters: - accumulators – the accumulators to merge. Only the first accumulator may be modified and returned for efficiency; the other accumulators should not be mutated, because they may be shared with other code and mutating them could lead to incorrect results or data corruption.
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
compact
(accumulator, *args, **kwargs)[source]¶ Optionally returns a more compact represenation of the accumulator.
This is called before an accumulator is sent across the wire, and can be useful in cases where values are buffered or otherwise lazily kept unprocessed when added to the accumulator. Should return an equivalent, though possibly modified, accumulator.
By default returns the accumulator unmodified.
Parameters: - accumulator – the current accumulator
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
extract_output
(accumulator, *args, **kwargs)[source]¶ Return result of converting accumulator into the output value.
Parameters: - accumulator – the final accumulator value computed by this CombineFn for the entire input key or PCollection. Can be modified for efficiency.
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
teardown
(*args, **kwargs)[source]¶ Called to clean up an instance before it is discarded.
If you are using Dataflow, you need to enable Dataflow Runner V2 before using this feature.
Parameters: - *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
apply
(elements, *args, **kwargs)[source]¶ Returns result of applying this CombineFn to the input values.
Parameters: - elements – the set of values to combine.
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
for_input_type
(input_type)[source]¶ Returns a specialized implementation of self, if it exists.
Otherwise, returns self.
Parameters: input_type – the type of input elements.
-
to_runner_api_parameter
(context)¶
-
class
apache_beam.transforms.core.
PartitionFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.typehints.decorators.WithTypeHints
A function object used by a Partition transform.
A PartitionFn specifies how individual values in a PCollection will be placed into separate partitions, indexed by an integer.
-
partition_for
(element, num_partitions, *args, **kwargs)[source]¶ Specify which partition will receive this element.
Parameters: - element – An element of the input PCollection.
- num_partitions – Number of partitions, i.e., output PCollections.
- *args – optional parameters and side inputs.
- **kwargs – optional parameters and side inputs.
Returns: An integer in [0, num_partitions).
-
-
class
apache_beam.transforms.core.
ParDo
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformWithSideInputs
A
ParDo
transform.Processes an input
PCollection
by applying aDoFn
to each element and returning the accumulated results into an outputPCollection
. The type of the elements is not fixed as long as theDoFn
can deal with it. In reality the type is restrained to some extent because the elements sometimes must be persisted to external storage. See theexpand()
method comments for a detailed description of all possible arguments.Note that the
DoFn
must return an iterable for each element of the inputPCollection
. An easy way to do this is to use theyield
keyword in the process method.Parameters: - pcoll (PCollection) – a
PCollection
to be processed. - fn (typing.Union[DoFn, typing.Callable]) – a
DoFn
object to be applied to each element of pcoll argument, or a Callable. - *args – positional arguments passed to the
DoFn
object. - **kwargs – keyword arguments passed to the
DoFn
object.
Note that the positional and keyword arguments will be processed in order to detect
PCollection
s that will be computed as side inputs to the transform. During pipeline execution whenever theDoFn
object gets executed (itsDoFn.process()
method gets called) thePCollection
arguments will be replaced by values from thePCollection
in the exact positions where they appear in the argument lists.-
with_exception_handling
(main_tag='good', dead_letter_tag='bad', *, exc_class=<class 'Exception'>, partial=False, use_subprocess=False, threshold=1, threshold_windowing=None, timeout=None)[source]¶ Automatically provides a dead letter output for skipping bad records. This can allow a pipeline to continue successfully rather than fail or continuously throw errors on retry when bad elements are encountered.
This returns a tagged output with two PCollections, the first being the results of successfully processing the input PCollection, and the second being the set of bad records (those which threw exceptions during processing) along with information about the errors raised.
For example, one would write:
good, bad = Map(maybe_error_raising_function).with_exception_handling()
and good will be a PCollection of mapped records and bad will contain those that raised exceptions.
Parameters: - main_tag – tag to be used for the main (good) output of the DoFn, useful to avoid possible conflicts if this DoFn already produces multiple outputs. Optional, defaults to ‘good’.
- dead_letter_tag – tag to be used for the bad records, useful to avoid possible conflicts if this DoFn already produces multiple outputs. Optional, defaults to ‘bad’.
- exc_class – An exception class, or tuple of exception classes, to catch. Optional, defaults to ‘Exception’.
- partial – Whether to emit outputs for an element as they’re produced (which could result in partial outputs for a ParDo or FlatMap that throws an error part way through execution) or buffer all outputs until successful processing of the entire element. Optional, defaults to False.
- use_subprocess – Whether to execute the DoFn logic in a subprocess. This allows one to recover from errors that can crash the calling process (e.g. from an underlying C/C++ library causing a segfault), but is slower as elements and results must cross a process boundary. Note that this starts up a long-running process that is used to handle all the elements (until hard failure, which should be rare) rather than a new process per element, so the overhead should be minimal (and can be amortized if there’s any per-process or per-bundle initialization that needs to be done). Optional, defaults to False.
- threshold – An upper bound on the ratio of records that can be bad before aborting the entire pipeline. Optional, defaults to 1.0 (meaning up to 100% of records can be bad and the pipeline will still succeed).
- threshold_windowing – Event-time windowing to use for threshold. Optional, defaults to the windowing of the input.
- timeout – If the element has not finished processing in timeout seconds, raise a TimeoutError. Defaults to None, meaning no time limit.
-
with_outputs
(*tags, main=None, allow_unknown_tags=None)[source]¶ Returns a tagged tuple allowing access to the outputs of a
ParDo
.The resulting object supports access to the
PCollection
associated with a tag (e.g.o.tag
,o[tag]
) and iterating over the available tags (e.g.for tag in o: ...
).Parameters: - *tags – if non-empty, list of valid tags. If a list of valid tags is given, it will be an error to use an undeclared tag later in the pipeline.
- **main_kw – dictionary empty or with one key
'main'
defining the tag to be used for the main output (which will not have a tag associated with it).
Returns: An object of type
DoOutputsTuple
that bundles together all the outputs of aParDo
transform and allows accessing the individualPCollection
s for each output using anobject.tag
syntax.Return type: DoOutputsTuple
Raises: TypeError
– if the self object is not aPCollection
that is the result of aParDo
transform.ValueError
– if main_kw contains any key other than'main'
.
- pcoll (PCollection) – a
-
apache_beam.transforms.core.
FlatMap
(fn=<function identity>, *args, **kwargs)[source]¶ FlatMap()
is likeParDo
except it takes a callable to specify the transformation.The callable must return an iterable for each element of the input
PCollection
. The elements of these iterables will be flattened into the outputPCollection
. If no callable is given, then all elements of the input PCollection must already be iterables themselves and will be flattened into the output PCollection.Parameters: - fn (callable) – a callable object.
- *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A
PCollection
containing theFlatMap()
outputs.Return type: Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass aDoFn
instance which is supported only forParDo
.
-
apache_beam.transforms.core.
Map
(fn, *args, **kwargs)[source]¶ Map()
is likeFlatMap()
except its callable returns only a single element.Parameters: - fn (callable) – a callable object.
- *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A
PCollection
containing theMap()
outputs.Return type: Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass aDoFn
instance which is supported only forParDo
.
-
apache_beam.transforms.core.
MapTuple
(fn, *args, **kwargs)[source]¶ MapTuple()
is likeMap()
but expects tuple inputs and flattens them into multiple input arguments.beam.MapTuple(lambda a, b, …: …)In other words
beam.MapTuple(fn)is equivalent to
beam.Map(lambda element, …: fn(*element, …))This can be useful when processing a PCollection of tuples (e.g. key-value pairs).
Parameters: - fn (callable) – a callable object.
- *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A
PCollection
containing theMapTuple()
outputs.Return type: Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass aDoFn
instance which is supported only forParDo
.
-
apache_beam.transforms.core.
FlatMapTuple
(fn, *args, **kwargs)[source]¶ FlatMapTuple()
is likeFlatMap()
but expects tuple inputs and flattens them into multiple input arguments.beam.FlatMapTuple(lambda a, b, …: …)is equivalent to Python 2
beam.FlatMap(lambda (a, b, …), …: …)In other words
beam.FlatMapTuple(fn)is equivalent to
beam.FlatMap(lambda element, …: fn(*element, …))This can be useful when processing a PCollection of tuples (e.g. key-value pairs).
Parameters: - fn (callable) – a callable object.
- *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A
PCollection
containing theFlatMapTuple()
outputs.Return type: Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass aDoFn
instance which is supported only forParDo
.
-
apache_beam.transforms.core.
Filter
(fn, *args, **kwargs)[source]¶ Filter()
is aFlatMap()
with its callable filtering out elements.Filter accepts a function that keeps elements that return True, and filters out the remaining elements.
Parameters: - fn (
Callable[..., bool]
) – a callable object. First argument will be an element. - *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A
PCollection
containing theFilter()
outputs.Return type: Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass aDoFn
instance which is supported only forParDo
.- fn (
-
class
apache_beam.transforms.core.
CombineGlobally
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A
CombineGlobally
transform.Reduces a
PCollection
to a single value by progressively applying aCombineFn
to portions of thePCollection
(and to intermediate values created thereby). See documentation inCombineFn
for details on the specifics on howCombineFn
s are applied.Parameters: - pcoll (PCollection) – a
PCollection
to be reduced into a single value. - fn (callable) – a
CombineFn
object that will be called to progressively reduce thePCollection
into single values, or a callable suitable for wrapping byCallableWrapperCombineFn
. - *args – positional arguments passed to the
CombineFn
object. - **kwargs – keyword arguments passed to the
CombineFn
object.
Raises: TypeError
– If the output type of the inputPCollection
is not compatible withIterable[A]
.Returns: A single-element
PCollection
containing the main output of theCombineGlobally
transform.Return type: Note that the positional and keyword arguments will be processed in order to detect
PValue
s that will be computed as side inputs to the transform. During pipeline execution whenever theCombineFn
object gets executed (i.e. any of theCombineFn
methods get called), thePValue
arguments will be replaced by their actual value in the exact position where they appear in the argument lists.-
has_defaults
= True¶
-
as_view
= False¶
-
fanout
= None¶
- pcoll (PCollection) – a
-
class
apache_beam.transforms.core.
CombinePerKey
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformWithSideInputs
A per-key Combine transform.
Identifies sets of values associated with the same key in the input PCollection, then applies a CombineFn to condense those sets to single values. See documentation in CombineFn for details on the specifics on how CombineFns are applied.
Parameters: - pcoll – input pcollection.
- fn – instance of CombineFn to apply to all values under the same key in
pcoll, or a callable whose signature is
f(iterable, *args, **kwargs)
(e.g., sum, max). - *args – arguments and side inputs, passed directly to the CombineFn.
- **kwargs – arguments and side inputs, passed directly to the CombineFn.
Returns: A PObject holding the result of the combine operation.
-
with_hot_key_fanout
(fanout)[source]¶ A per-key combine operation like self but with two levels of aggregation.
If a given key is produced by too many upstream bundles, the final reduction can become a bottleneck despite partial combining being lifted pre-GroupByKey. In these cases it can be helpful to perform intermediate partial aggregations in parallel and then re-group to peform a final (per-key) combine. This is also useful for high-volume keys in streaming where combiners are not generally lifted for latency reasons.
Note that a fanout greater than 1 requires the data to be sent through two GroupByKeys, and a high fanout can also result in more shuffle data due to less per-bundle combining. Setting the fanout for a key at 1 or less places values on the “cold key” path that skip the intermediate level of aggregation.
Parameters: fanout – either None, for no fanout, an int, for a constant-degree fanout, or a callable mapping keys to a key-specific degree of fanout. Returns: A per-key combining PTransform with the specified fanout.
-
class
apache_beam.transforms.core.
CombineValues
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformWithSideInputs
-
class
apache_beam.transforms.core.
GroupByKey
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A group by key transform.
Processes an input PCollection consisting of key/value pairs represented as a tuple pair. The result is a PCollection where values having a common key are grouped together. For example (a, 1), (b, 2), (a, 3) will result into (a, [1, 3]), (b, [2]).
The implementation here is used only when run on the local direct runner.
-
class
apache_beam.transforms.core.
GroupBy
(*fields, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Groups a PCollection by one or more expressions, used to derive the key.
GroupBy(expr) is roughly equivalent to
beam.Map(lambda v: (expr(v), v)) | beam.GroupByKey()but provides several conveniences, e.g.
- Several arguments may be provided, as positional or keyword arguments, resulting in a tuple-like key. For example GroupBy(a=expr1, b=expr2) groups by a key with attributes a and b computed by applying expr1 and expr2 to each element.
- Strings can be used as a shorthand for accessing an attribute, e.g. GroupBy(‘some_field’) is equivalent to GroupBy(lambda v: getattr(v, ‘some_field’)).
The GroupBy operation can be made into an aggregating operation by invoking its aggregate_field method.
-
aggregate_field
(field, combine_fn, dest)[source]¶ Returns a grouping operation that also aggregates grouped values.
Parameters: - field – indicates the field to be aggregated
- combine_fn – indicates the aggregation function to be used
- dest – indicates the name that will be used for the aggregate in the output
May be called repeatedly to aggregate multiple fields, e.g.
- GroupBy(‘key’)
- .aggregate_field(‘some_attr’, sum, ‘sum_attr’) .aggregate_field(lambda v: …, MeanCombineFn, ‘mean’)
-
class
apache_beam.transforms.core.
Select
(*args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Converts the elements of a PCollection into a schema’d PCollection of Rows.
Select(…) is roughly equivalent to Map(lambda x: Row(…)) where each argument (which may be a string or callable) of ToRow is applied to x. For example,
pcoll | beam.Select(‘a’, b=lambda x: foo(x))is the same as
pcoll | beam.Map(lambda x: beam.Row(a=x.a, b=foo(x)))
-
class
apache_beam.transforms.core.
Partition
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformWithSideInputs
Split a PCollection into several partitions.
Uses the specified PartitionFn to separate an input PCollection into the specified number of sub-PCollections.
When apply()d, a Partition() PTransform requires the following:
Parameters: - partitionfn – a PartitionFn, or a callable with the signature described in CallableWrapperPartitionFn.
- n – number of output partitions.
The result of this PTransform is a simple list of the output PCollections representing each of n partitions, in order.
-
class
ApplyPartitionFnFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.core.DoFn
A DoFn that applies a PartitionFn.
-
class
apache_beam.transforms.core.
Windowing
(windowfn, triggerfn=None, accumulation_mode=None, timestamp_combiner=None, allowed_lateness=0, environment_id=None)[source]¶ Bases:
object
Class representing the window strategy.
Parameters: - windowfn – Window assign function.
- triggerfn – Trigger function.
- accumulation_mode – a AccumulationMode, controls what to do with data when a trigger fires multiple times.
- timestamp_combiner – a TimestampCombiner, determines how output timestamps of grouping operations are assigned.
- allowed_lateness – Maximum delay in seconds after end of window allowed for any late data to be processed without being discarded directly.
- environment_id – Environment where the current window_fn should be applied in.
-
class
apache_beam.transforms.core.
WindowInto
(windowfn, trigger=None, accumulation_mode=None, timestamp_combiner=None, allowed_lateness=0)[source]¶ Bases:
apache_beam.transforms.core.ParDo
A window transform assigning windows to each element of a PCollection.
Transforms an input PCollection by applying a windowing function to each element. Each transformed element in the result will be a WindowedValue element with the same input value and timestamp, with its new set of windows determined by the windowing function.
Initializes a WindowInto transform.
Parameters: - windowfn (Windowing, WindowFn) – Function to be used for windowing.
- trigger – (optional) Trigger used for windowing, or None for default.
- accumulation_mode – (optional) Accumulation mode used for windowing, required for non-trivial triggers.
- timestamp_combiner – (optional) Timestamp combniner used for windowing, or None for default.
-
class
WindowIntoFn
(windowing)[source]¶ Bases:
apache_beam.transforms.core.DoFn
A DoFn that applies a WindowInto operation.
-
class
apache_beam.transforms.core.
Flatten
(**kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Merges several PCollections into a single PCollection.
Copies all elements in 0 or more PCollections into a single output PCollection. If there are no input PCollections, the resulting PCollection will be empty (but see also kwargs below).
Parameters: **kwargs – Accepts a single named argument “pipeline”, which specifies the pipeline that “owns” this PTransform. Ordinarily Flatten can obtain this information from one of the input PCollections, but if there are none (or if there’s a chance there may be none), this argument is the only way to provide pipeline information and should be considered mandatory.
-
class
apache_beam.transforms.core.
Create
(values, reshuffle=True)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A transform that creates a PCollection from an iterable.
Initializes a Create transform.
Parameters: values – An object of values for the PCollection
-
class
apache_beam.transforms.core.
Impulse
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Impulse primitive.