apache_beam.transforms package¶
Submodules¶
apache_beam.transforms.combiners module¶
A library of basic combiner PTransform subclasses.
-
class
apache_beam.transforms.combiners.
Count
[source]¶ Bases:
object
Combiners for counting elements.
-
class
Globally
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
combiners.Count.Globally counts the total number of elements.
-
class
PerElement
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
combiners.Count.PerElement counts how many times each element occurs.
-
class
PerKey
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
combiners.Count.PerKey counts how many elements each unique key has.
-
class
-
class
apache_beam.transforms.combiners.
Mean
[source]¶ Bases:
object
Combiners for computing arithmetic means of elements.
-
class
Globally
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
combiners.Mean.Globally computes the arithmetic mean of the elements.
-
class
PerKey
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
combiners.Mean.PerKey finds the means of the values for each key.
-
class
-
class
apache_beam.transforms.combiners.
Sample
[source]¶ Bases:
object
Combiners for sampling n elements without replacement.
-
FixedSizeGlobally
= <CallablePTransform(PTransform) label=[FixedSizeGlobally]>¶
-
FixedSizePerKey
= <CallablePTransform(PTransform) label=[FixedSizePerKey]>¶
-
-
class
apache_beam.transforms.combiners.
Top
[source]¶ Bases:
object
Combiners for obtaining extremal elements.
-
Largest
= <CallablePTransform(PTransform) label=[Largest]>¶
-
LargestPerKey
= <CallablePTransform(PTransform) label=[LargestPerKey]>¶
-
Of
= <CallablePTransform(PTransform) label=[Of]>¶
-
PerKey
= <CallablePTransform(PTransform) label=[PerKey]>¶
-
Smallest
= <CallablePTransform(PTransform) label=[Smallest]>¶
-
SmallestPerKey
= <CallablePTransform(PTransform) label=[SmallestPerKey]>¶
-
-
class
apache_beam.transforms.combiners.
ToDict
(label='ToDict')[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A global CombineFn that condenses a PCollection into a single dict.
PCollections should consist of 2-tuples, notionally (key, value) pairs. If multiple values are associated with the same key, only one of the values will be present in the resulting dict.
-
class
apache_beam.transforms.combiners.
ToList
(label='ToList')[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A global CombineFn that condenses a PCollection into a single list.
apache_beam.transforms.core module¶
Core PTransform subclasses, such as FlatMap, GroupByKey, and Map.
-
class
apache_beam.transforms.core.
DoFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.typehints.decorators.WithTypeHints
,apache_beam.transforms.display.HasDisplayData
A function object used by a transform with custom processing.
The ParDo transform is such a transform. The ParDo.apply method will take an object of type DoFn and apply it to all elements of a PCollection object.
In order to have concrete DoFn objects one has to subclass from DoFn and define the desired behavior (start_bundle/finish_bundle and process) or wrap a callable object using the CallableWrapperDoFn class.
-
DoFnParams
= ['ElementParam', 'SideInputParam', 'TimestampParam', 'WindowParam']¶
-
ElementParam
= 'ElementParam'¶
-
SideInputParam
= 'SideInputParam'¶
-
TimestampParam
= 'TimestampParam'¶
-
WindowParam
= 'WindowParam'¶
-
get_function_arguments
(func)[source]¶ Return the function arguments based on the name provided. If they have a _inspect_function attached to the class then use that otherwise default to the python inspect library.
-
-
class
apache_beam.transforms.core.
CombineFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.typehints.decorators.WithTypeHints
,apache_beam.transforms.display.HasDisplayData
A function object used by a Combine transform with custom processing.
A CombineFn specifies how multiple values in all or part of a PCollection can be merged into a single value—essentially providing the same kind of information as the arguments to the Python “reduce” builtin (except for the input argument, which is an instance of CombineFnProcessContext). The combining process proceeds as follows:
- Input values are partitioned into one or more batches.
- For each batch, the create_accumulator method is invoked to create a fresh initial “accumulator” value representing the combination of zero values.
- For each input value in the batch, the add_input method is invoked to combine more values with the accumulator for that batch.
- The merge_accumulators method is invoked to combine accumulators from separate batches into a single combined output accumulator value, once all of the accumulators have had all the input value in their batches added to them. This operation is invoked repeatedly, until there is only one accumulator value left.
- The extract_output operation is invoked on the final accumulator to get the output value.
-
add_input
(accumulator, element, *args, **kwargs)[source]¶ Return result of folding element into accumulator.
CombineFn implementors must override add_input.
Parameters: - accumulator – the current accumulator
- element – the element to add
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
add_inputs
(accumulator, elements, *args, **kwargs)[source]¶ Returns the result of folding each element in elements into accumulator.
This is provided in case the implementation affords more efficient bulk addition of elements. The default implementation simply loops over the inputs invoking add_input for each one.
Parameters: - accumulator – the current accumulator
- elements – the elements to add
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
apply
(elements, *args, **kwargs)[source]¶ Returns result of applying this CombineFn to the input values.
Parameters: - elements – the set of values to combine.
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
create_accumulator
(*args, **kwargs)[source]¶ Return a fresh, empty accumulator for the combine operation.
Parameters: - *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
extract_output
(accumulator, *args, **kwargs)[source]¶ Return result of converting accumulator into the output value.
Parameters: - accumulator – the final accumulator value computed by this CombineFn for the entire input key or PCollection.
- *args – Additional arguments and side inputs.
- **kwargs – Additional arguments and side inputs.
-
class
apache_beam.transforms.core.
PartitionFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.typehints.decorators.WithTypeHints
A function object used by a Partition transform.
A PartitionFn specifies how individual values in a PCollection will be placed into separate partitions, indexed by an integer.
-
partition_for
(element, num_partitions, *args, **kwargs)[source]¶ Specify which partition will receive this element.
Parameters: - element – An element of the input PCollection.
- num_partitions – Number of partitions, i.e., output PCollections.
- *args – optional parameters and side inputs.
- **kwargs – optional parameters and side inputs.
Returns: An integer in [0, num_partitions).
-
-
class
apache_beam.transforms.core.
ParDo
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformWithSideInputs
A ParDo transform.
Processes an input PCollection by applying a DoFn to each element and returning the accumulated results into an output PCollection. The type of the elements is not fixed as long as the DoFn can deal with it. In reality the type is restrained to some extent because the elements sometimes must be persisted to external storage. See the expand() method comments for a detailed description of all possible arguments.
Note that the DoFn must return an iterable for each element of the input PCollection. An easy way to do this is to use the yield keyword in the process method.
Parameters: - pcoll – a PCollection to be processed.
- fn – a DoFn object to be applied to each element of pcoll argument.
- *args – positional arguments passed to the dofn object.
- **kwargs – keyword arguments passed to the dofn object.
Note that the positional and keyword arguments will be processed in order to detect PCollections that will be computed as side inputs to the transform. During pipeline execution whenever the DoFn object gets executed (its apply() method gets called) the PCollection arguments will be replaced by values from the PCollection in the exact positions where they appear in the argument lists.
-
with_outputs
(*tags, **main_kw)[source]¶ Returns a tagged tuple allowing access to the outputs of a ParDo.
The resulting object supports access to the PCollection associated with a tag (e.g., o.tag, o[tag]) and iterating over the available tags (e.g., for tag in o: ...).
Parameters: - *tags – if non-empty, list of valid tags. If a list of valid tags is given, it will be an error to use an undeclared tag later in the pipeline.
- **main_kw – dictionary empty or with one key ‘main’ defining the tag to be used for the main output (which will not have a tag associated with it).
Returns: An object of type DoOutputsTuple that bundles together all the outputs of a ParDo transform and allows accessing the individual PCollections for each output using an object.tag syntax.
Raises: TypeError
– if the self object is not a PCollection that is the result of a ParDo transform.ValueError
– if main_kw contains any key other than ‘main’.
-
apache_beam.transforms.core.
FlatMap
(fn, *args, **kwargs)[source]¶ FlatMap is like ParDo except it takes a callable to specify the transformation.
The callable must return an iterable for each element of the input PCollection. The elements of these iterables will be flattened into the output PCollection.
Parameters: - fn – a callable object.
- *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A PCollection containing the Map outputs.
Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for ParDo.
-
apache_beam.transforms.core.
Map
(fn, *args, **kwargs)[source]¶ Map is like FlatMap except its callable returns only a single element.
Parameters: - fn – a callable object.
- *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A PCollection containing the Map outputs.
Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for ParDo.
-
apache_beam.transforms.core.
Filter
(fn, *args, **kwargs)[source]¶ Filter is a FlatMap with its callable filtering out elements.
Parameters: - fn – a callable object.
- *args – positional arguments passed to the transform callable.
- **kwargs – keyword arguments passed to the transform callable.
Returns: A PCollection containing the Filter outputs.
Raises: TypeError
– If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for FlatMap.
-
class
apache_beam.transforms.core.
CombineGlobally
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A CombineGlobally transform.
Reduces a PCollection to a single value by progressively applying a CombineFn to portions of the PCollection (and to intermediate values created thereby). See documentation in CombineFn for details on the specifics on how CombineFns are applied.
Parameters: - pcoll – a PCollection to be reduced into a single value.
- fn – a CombineFn object that will be called to progressively reduce the PCollection into single values, or a callable suitable for wrapping by CallableWrapperCombineFn.
- *args – positional arguments passed to the CombineFn object.
- **kwargs – keyword arguments passed to the CombineFn object.
Raises: TypeError
– If the output type of the input PCollection is not compatible with Iterable[A].Returns: A single-element PCollection containing the main output of the Combine transform.
Note that the positional and keyword arguments will be processed in order to detect PObjects that will be computed as side inputs to the transform. During pipeline execution whenever the CombineFn object gets executed (i.e., any of the CombineFn methods get called), the PObject arguments will be replaced by their actual value in the exact position where they appear in the argument lists.
-
as_view
= False¶
-
has_defaults
= True¶
-
class
apache_beam.transforms.core.
CombinePerKey
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformWithSideInputs
A per-key Combine transform.
Identifies sets of values associated with the same key in the input PCollection, then applies a CombineFn to condense those sets to single values. See documentation in CombineFn for details on the specifics on how CombineFns are applied.
Parameters: - pcoll – input pcollection.
- fn – instance of CombineFn to apply to all values under the same key in
pcoll, or a callable whose signature is
f(iterable, *args, **kwargs)
(e.g., sum, max). - *args – arguments and side inputs, passed directly to the CombineFn.
- **kwargs – arguments and side inputs, passed directly to the CombineFn.
Returns: A PObject holding the result of the combine operation.
-
class
apache_beam.transforms.core.
CombineValues
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformWithSideInputs
-
class
apache_beam.transforms.core.
GroupByKey
(label=None)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A group by key transform.
Processes an input PCollection consisting of key/value pairs represented as a tuple pair. The result is a PCollection where values having a common key are grouped together. For example (a, 1), (b, 2), (a, 3) will result into (a, [1, 3]), (b, [2]).
The implementation here is used only when run on the local direct runner.
-
class
apache_beam.transforms.core.
Partition
(fn, *args, **kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransformWithSideInputs
Split a PCollection into several partitions.
Uses the specified PartitionFn to separate an input PCollection into the specified number of sub-PCollections.
When apply()d, a Partition() PTransform requires the following:
Parameters: - partitionfn – a PartitionFn, or a callable with the signature described in CallableWrapperPartitionFn.
- n – number of output partitions.
The result of this PTransform is a simple list of the output PCollections representing each of n partitions, in order.
-
class
ApplyPartitionFnFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.core.DoFn
A DoFn that applies a PartitionFn.
-
class
apache_beam.transforms.core.
Windowing
(windowfn, triggerfn=None, accumulation_mode=None, timestamp_combiner=None)[source]¶ Bases:
object
-
class
apache_beam.transforms.core.
WindowInto
(windowfn, **kwargs)[source]¶ Bases:
apache_beam.transforms.core.ParDo
A window transform assigning windows to each element of a PCollection.
Transforms an input PCollection by applying a windowing function to each element. Each transformed element in the result will be a WindowedValue element with the same input value and timestamp, with its new set of windows determined by the windowing function.
-
class
WindowIntoFn
(windowing)[source]¶ Bases:
apache_beam.transforms.core.DoFn
A DoFn that applies a WindowInto operation.
-
class
-
class
apache_beam.transforms.core.
Flatten
(**kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Merges several PCollections into a single PCollection.
Copies all elements in 0 or more PCollections into a single output PCollection. If there are no input PCollections, the resulting PCollection will be empty (but see also kwargs below).
Parameters: **kwargs – Accepts a single named argument “pipeline”, which specifies the pipeline that “owns” this PTransform. Ordinarily Flatten can obtain this information from one of the input PCollections, but if there are none (or if there’s a chance there may be none), this argument is the only way to provide pipeline information and should be considered mandatory.
-
class
apache_beam.transforms.core.
Create
(value)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A transform that creates a PCollection from an iterable.
apache_beam.transforms.cy_combiners module¶
A library of basic cythonized CombineFn subclasses.
For internal use only; no backwards-compatibility guarantees.
-
class
apache_beam.transforms.cy_combiners.
AccumulatorCombineFn
(*unused_args, **unused_kwargs)[source]¶
-
class
apache_beam.transforms.cy_combiners.
AllCombineFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.cy_combiners.AccumulatorCombineFn
-
class
apache_beam.transforms.cy_combiners.
AnyCombineFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.cy_combiners.AccumulatorCombineFn
-
class
apache_beam.transforms.cy_combiners.
CountCombineFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.cy_combiners.AccumulatorCombineFn
-
class
apache_beam.transforms.cy_combiners.
MaxFloatFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.cy_combiners.AccumulatorCombineFn
-
class
apache_beam.transforms.cy_combiners.
MaxInt64Fn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.cy_combiners.AccumulatorCombineFn
-
class
apache_beam.transforms.cy_combiners.
MeanFloatFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.cy_combiners.AccumulatorCombineFn
-
class
apache_beam.transforms.cy_combiners.
MeanInt64Fn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.cy_combiners.AccumulatorCombineFn
-
class
apache_beam.transforms.cy_combiners.
MinFloatFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.cy_combiners.AccumulatorCombineFn
-
class
apache_beam.transforms.cy_combiners.
MinInt64Fn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.cy_combiners.AccumulatorCombineFn
-
class
apache_beam.transforms.cy_combiners.
SumFloatFn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.cy_combiners.AccumulatorCombineFn
-
class
apache_beam.transforms.cy_combiners.
SumInt64Fn
(*unused_args, **unused_kwargs)[source]¶ Bases:
apache_beam.transforms.cy_combiners.AccumulatorCombineFn
apache_beam.transforms.display module¶
DisplayData, its classes, interfaces and methods.
The classes in this module allow users and transform developers to define static display data to be displayed when a pipeline runs. PTransforms, DoFns and other pipeline components are subclasses of the HasDisplayData mixin. To add static display data to a component, you can override the display_data method of the HasDisplayData class.
Available classes:
- HasDisplayData - Components that inherit from this class can have static
- display data shown in the UI.
- DisplayDataItem - This class represents static display data elements.
- DisplayData - Internal class that is used to create display data and
- communicate it to the API.
-
class
apache_beam.transforms.display.
HasDisplayData
[source]¶ Bases:
object
Basic mixin for elements that contain display data.
It implements only the display_data method and a _namespace method.
-
display_data
()[source]¶ Returns the display data associated to a pipeline component.
It should be reimplemented in pipeline components that wish to have static display data.
Returns: value pairs. The value might be an integer, float or string value; a DisplayDataItem for values that have more data (e.g. short value, label, url); or a HasDisplayData instance that has more display data that should be picked up. For example: - { ‘key1’: ‘string_value’,
- ‘key2’: 1234, ‘key3’: 3.14159265, ‘key4’: DisplayDataItem(‘apache.org’, url=’http://apache.org‘), ‘key5’: subComponent }
Return type: A dictionary containing key
-
-
class
apache_beam.transforms.display.
DisplayDataItem
(value, url=None, label=None, namespace=None, key=None, shortValue=None)[source]¶ Bases:
object
A DisplayDataItem represents a unit of static display data.
Each item is identified by a key and the namespace of the component the display item belongs to.
-
drop_if_default
(default)[source]¶ The item should be dropped if its value is equal to its default.
Returns: Returns self.
-
get_dict
()[source]¶ Returns the internal-API dictionary representing the DisplayDataItem.
Returns: A dictionary. The internal-API dictionary representing the DisplayDataItem Raises: ValueError
– if the item is not valid.
-
is_valid
()[source]¶ Checks that all the necessary fields of the DisplayDataItem are filled in. It checks that neither key, namespace, value or type are None.
Raises: ValueError
– If the item does not have a key, namespace, value or type.
-
should_drop
()[source]¶ Return True if the item should be dropped, or False if it should not be dropped. This depends on the drop_if_none, and drop_if_default calls.
Returns: True or False; depending on whether the item should be dropped or kept.
-
typeDict
= {<type 'datetime.datetime'>: 'TIMESTAMP', <type 'bool'>: 'BOOLEAN', <type 'str'>: 'STRING', <type 'datetime.timedelta'>: 'DURATION', <type 'unicode'>: 'STRING', <type 'int'>: 'INTEGER', <type 'float'>: 'FLOAT'}¶
-
-
class
apache_beam.transforms.display.
DisplayData
(namespace, display_data_dict)[source]¶ Bases:
object
Static display data associated with a pipeline component.
-
classmethod
create_from
(has_display_data)[source]¶ Creates DisplayData from a HasDisplayData instance.
Returns: A DisplayData instance with populated items. Raises: ValueError
– If the has_display_data argument is not an instance of HasDisplayData.
-
classmethod
create_from_options
(pipeline_options)[source]¶ Creates DisplayData from a PipelineOptions instance.
When creating DisplayData, this method will convert the value of any item of a non-supported type to its string representation. The normal DisplayData.create_from method rejects those items.
Returns: A DisplayData instance with populated items. Raises: ValueError
– If the has_display_data argument is not an instance of HasDisplayData.
-
classmethod
apache_beam.transforms.ptransform module¶
PTransform and descendants.
A PTransform is an object describing (not executing) a computation. The actual execution semantics for a transform is captured by a runner object. A transform object always belongs to a pipeline object.
A PTransform derived class needs to define the expand() method that describes how one or more PValues are created by the transform.
The module defines a few standard transforms: FlatMap (parallel do), GroupByKey (group by key), etc. Note that the expand() methods for these classes contain code that will add nodes to the processing graph associated with a pipeline.
As support for the FlatMap transform, the module also defines a DoFn class and wrapper class that allows lambda functions to be used as FlatMap processing functions.
-
class
apache_beam.transforms.ptransform.
PTransform
(label=None)[source]¶ Bases:
apache_beam.typehints.decorators.WithTypeHints
,apache_beam.transforms.display.HasDisplayData
A transform object used to modify one or more PCollections.
Subclasses must define an expand() method that will be used when the transform is applied to some arguments. Typical usage pattern will be:
input | CustomTransform(...)The expand() method of the CustomTransform object passed in will be called with input as an argument.
-
get_windowing
(inputs)[source]¶ Returns the window function to be associated with transform’s output.
By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).
-
label
¶
-
pipeline
= None¶
-
side_inputs
= ()¶
-
with_input_types
(input_type_hint)[source]¶ Annotates the input type of a PTransform with a type-hint.
Parameters: input_type_hint – An instance of an allowed built-in type, a custom class, or an instance of a typehints.TypeConstraint. Raises: TypeError
– If ‘type_hint’ is not a valid type-hint. See typehints.validate_composite_type_param for further details.Returns: A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
-
with_output_types
(type_hint)[source]¶ Annotates the output type of a PTransform with a type-hint.
Parameters: type_hint – An instance of an allowed built-in type, a custom class, or a typehints.TypeConstraint. Raises: TypeError
– If ‘type_hint’ is not a valid type-hint. See typehints.validate_composite_type_param for further details.Returns: A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
-
-
apache_beam.transforms.ptransform.
ptransform_fn
(fn)[source]¶ A decorator for a function-based PTransform.
Experimental; no backwards-compatibility guarantees.
Parameters: fn – A function implementing a custom PTransform. Returns: A CallablePTransform instance wrapping the function-based PTransform. This wrapper provides an alternative, simpler way to define a PTransform. The standard method is to subclass from PTransform and override the expand() method. An equivalent effect can be obtained by defining a function that an input PCollection and additional optional arguments and returns a resulting PCollection. For example:
@ptransform_fn def CustomMapper(pcoll, mapfn): return pcoll | ParDo(mapfn)
The equivalent approach using PTransform subclassing:
class CustomMapper(PTransform): def __init__(self, mapfn): super(CustomMapper, self).__init__() self.mapfn = mapfn def expand(self, pcoll): return pcoll | ParDo(self.mapfn)
With either method the custom PTransform can be used in pipelines as if it were one of the “native” PTransforms:
result_pcoll = input_pcoll | 'Label' >> CustomMapper(somefn)
Note that for both solutions the underlying implementation of the pipe operator (i.e., |) will inject the pcoll argument in its proper place (first argument if no label was specified and second argument otherwise).
apache_beam.transforms.sideinputs module¶
Internal side input transforms and implementations.
For internal use only; no backwards-compatibility guarantees.
Important: this module is an implementation detail and should not be used directly by pipeline writers. Instead, users should use the helper methods AsSingleton, AsIter, AsList and AsDict in apache_beam.pvalue.
apache_beam.transforms.timeutil module¶
Timestamp utilities.
apache_beam.transforms.trigger module¶
Support for Dataflow triggers.
Triggers control when in processing time windows get emitted.
-
class
apache_beam.transforms.trigger.
AccumulationMode
[source]¶ Bases:
object
Controls what to do with data when a trigger fires multiple times.
-
ACCUMULATING
= 1¶
-
DISCARDING
= 0¶
-
-
class
apache_beam.transforms.trigger.
TriggerFn
[source]¶ Bases:
object
A TriggerFn determines when window (panes) are emitted.
See https://beam.apache.org/documentation/programming-guide/#triggers
-
on_element
(element, window, context)[source]¶ Called when a new element arrives in a window.
Parameters: - element – the element being added
- window – the window to which the element is being added
- context – a context (e.g. a TriggerContext instance) for managing state and setting timers
-
on_fire
(watermark, window, context)[source]¶ Called when a trigger actually fires.
Parameters: - watermark – (a lower bound on) the watermark of the system
- window – the window whose trigger is being fired
- context – a context (e.g. a TriggerContext instance) for managing state and setting timers
Returns: whether this trigger is finished
-
on_merge
(to_be_merged, merge_result, context)[source]¶ Called when multiple windows are merged.
Parameters: - to_be_merged – the set of windows to be merged
- merge_result – the window into which the windows are being merged
- context – a context (e.g. a TriggerContext instance) for managing state and setting timers
-
should_fire
(watermark, window, context)[source]¶ Whether this trigger should cause the window to fire.
Parameters: - watermark – (a lower bound on) the watermark of the system
- window – the window whose trigger is being considered
- context – a context (e.g. a TriggerContext instance) for managing state and setting timers
Returns: whether this trigger should cause a firing
-
-
class
apache_beam.transforms.trigger.
DefaultTrigger
[source]¶ Bases:
apache_beam.transforms.trigger.TriggerFn
Semantically Repeatedly(AfterWatermark()), but more optimized.
-
class
apache_beam.transforms.trigger.
AfterWatermark
(early=None, late=None)[source]¶ Bases:
apache_beam.transforms.trigger.TriggerFn
Fire exactly once when the watermark passes the end of the window.
Parameters: - early – if not None, a speculative trigger to repeatedly evaluate before the watermark passes the end of the window
- late – if not None, a speculative trigger to repeatedly evaluate after the watermark passes the end of the window
-
LATE_TAG
= CombiningValueStateTag(is_late, CallableWrapperCombineFn(<built-in function any>))¶
-
class
apache_beam.transforms.trigger.
AfterCount
(count)[source]¶ Bases:
apache_beam.transforms.trigger.TriggerFn
Fire when there are at least count elements in this window pane.
AfterCount is experimental. No backwards compatibility guarantees.
-
COUNT_TAG
= CombiningValueStateTag(count, <apache_beam.transforms.combiners.CountCombineFn object>)¶
-
-
class
apache_beam.transforms.trigger.
Repeatedly
(underlying)[source]¶ Bases:
apache_beam.transforms.trigger.TriggerFn
Repeatedly invoke the given trigger, never finishing.
-
class
apache_beam.transforms.trigger.
AfterAny
(*triggers)[source]¶ Bases:
apache_beam.transforms.trigger._ParallelTriggerFn
Fires when any subtrigger fires.
Also finishes when any subtrigger finishes.
-
combine_op
()¶ any(iterable) -> bool
Return True if bool(x) is True for any x in the iterable. If the iterable is empty, return False.
-
-
class
apache_beam.transforms.trigger.
AfterAll
(*triggers)[source]¶ Bases:
apache_beam.transforms.trigger._ParallelTriggerFn
Fires when all subtriggers have fired.
Also finishes when all subtriggers have finished.
-
combine_op
()¶ all(iterable) -> bool
Return True if bool(x) is True for all values x in the iterable. If the iterable is empty, return True.
-
-
class
apache_beam.transforms.trigger.
AfterEach
(*triggers)[source]¶ Bases:
apache_beam.transforms.trigger.TriggerFn
-
INDEX_TAG
= CombiningValueStateTag(index, CallableWrapperCombineFn(<function <lambda>>))¶
-
apache_beam.transforms.util module¶
Simple utility PTransforms.
-
class
apache_beam.transforms.util.
CoGroupByKey
(**kwargs)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
Groups results across several PCollections by key.
Given an input dict mapping serializable keys (called “tags”) to 0 or more PCollections of (key, value) tuples, e.g.:
{'pc1': pcoll1, 'pc2': pcoll2, 33333: pcoll3}
creates a single output PCollection of (key, value) tuples whose keys are the unique input keys from all inputs, and whose values are dicts mapping each tag to an iterable of whatever values were under the key in the corresponding PCollection:
('some key', {'pc1': ['value 1 under "some key" in pcoll1', 'value 2 under "some key" in pcoll1'], 'pc2': [], 33333: ['only value under "some key" in pcoll3']})
Note that pcoll2 had no values associated with “some key”.
CoGroupByKey also works for tuples, lists, or other flat iterables of PCollections, in which case the values of the resulting PCollections will be tuples whose nth value is the list of values from the nth PCollection—conceptually, the “tags” are the indices into the input. Thus, for this input:
(pcoll1, pcoll2, pcoll3)
the output PCollection’s value for “some key” is:
('some key', (['value 1 under "some key" in pcoll1', 'value 2 under "some key" in pcoll1'], [], ['only value under "some key" in pcoll3']))
Parameters: - label – name of this transform instance. Useful while monitoring and debugging a pipeline execution.
- **kwargs – Accepts a single named argument “pipeline”, which specifies the pipeline that “owns” this PTransform. Ordinarily CoGroupByKey can obtain this information from one of the input PCollections, but if there are none (or if there’s a chance there may be none), this argument is the only way to provide pipeline information, and should be considered mandatory.
-
apache_beam.transforms.util.
Keys
(label='Keys')[source]¶ Produces a PCollection of first elements of 2-tuples in a PCollection.
apache_beam.transforms.window module¶
Windowing concepts.
A WindowInto transform logically divides up or groups the elements of a PCollection into finite windows according to a windowing function (derived from WindowFn).
The output of WindowInto contains the same elements as input, but they have been logically assigned to windows. The next GroupByKey(s) transforms, including one within a composite transform, will group by the combination of keys and windows.
Windowing a PCollection allows chunks of it to be processed individually, before the entire PCollection is available. This is especially important for PCollection(s) with unbounded size, since the full PCollection is never available at once, since more data is continually arriving. For PCollection(s) with a bounded size (aka. conventional batch mode), by default, all data is implicitly in a single window (see GlobalWindows), unless WindowInto is applied.
For example, a simple form of windowing divides up the data into fixed-width time intervals, using FixedWindows.
Seconds are used as the time unit for the built-in windowing primitives here. Integer or floating point seconds can be passed to these primitives.
Internally, seconds, with microsecond granularity, are stored as timeutil.Timestamp and timeutil.Duration objects. This is done to avoid precision errors that would occur with floating point representations.
Custom windowing function classes can be created, by subclassing from WindowFn.
-
class
apache_beam.transforms.window.
TimestampCombiner
[source]¶ Bases:
object
Determines how output timestamps of grouping operations are assigned.
-
OUTPUT_AT_EARLIEST
= 2¶
-
OUTPUT_AT_EARLIEST_TRANSFORMED
= 'OUTPUT_AT_EARLIEST_TRANSFORMED'¶
-
OUTPUT_AT_EOW
= 0¶
-
OUTPUT_AT_LATEST
= 1¶
-
-
class
apache_beam.transforms.window.
WindowFn
[source]¶ Bases:
apache_beam.utils.urns.RunnerApiFn
An abstract windowing function defining a basic assign and merge.
-
class
AssignContext
(timestamp, element=None)[source]¶ Bases:
object
Context passed to WindowFn.assign().
-
class
MergeContext
(windows)[source]¶ Bases:
object
Context passed to WindowFn.merge() to perform merging, if any.
-
get_transformed_output_time
(window, input_timestamp)[source]¶ Given input time and output window, returns output time for window.
If TimestampCombiner.OUTPUT_AT_EARLIEST_TRANSFORMED is used in the Windowing, the output timestamp for the given window will be the earliest of the timestamps returned by get_transformed_output_time() for elements of the window.
Parameters: - window – Output window of element.
- input_timestamp – Input timestamp of element as a timeutil.Timestamp object.
Returns: Transformed timestamp.
-
to_runner_api_parameter
(context)¶
-
class
-
class
apache_beam.transforms.window.
BoundedWindow
(end)[source]¶ Bases:
object
A window for timestamps in range (-infinity, end).
-
end
¶ End of window.
-
-
class
apache_beam.transforms.window.
IntervalWindow
(start, end)[source]¶ Bases:
apache_beam.transforms.window.BoundedWindow
A window for timestamps in range [start, end).
-
start
¶ Start of window as seconds since Unix epoch.
-
end
¶ End of window as seconds since Unix epoch.
-
-
class
apache_beam.transforms.window.
TimestampedValue
(value, timestamp)[source]¶ Bases:
object
A timestamped value having a value and a timestamp.
-
value
¶ The underlying value.
-
timestamp
¶ Timestamp associated with the value as seconds since Unix epoch.
-
-
class
apache_beam.transforms.window.
GlobalWindow
[source]¶ Bases:
apache_beam.transforms.window.BoundedWindow
The default window into which all data is placed (via GlobalWindows).
-
class
apache_beam.transforms.window.
GlobalWindows
[source]¶ Bases:
apache_beam.transforms.window.NonMergingWindowFn
A windowing function that assigns everything to one global window.
-
class
apache_beam.transforms.window.
FixedWindows
(size, offset=0)[source]¶ Bases:
apache_beam.transforms.window.NonMergingWindowFn
A windowing function that assigns each element to one time interval.
The attributes size and offset determine in what time interval a timestamp will be slotted. The time intervals have the following formula: [N * size + offset, (N + 1) * size + offset)
-
size
¶ Size of the window as seconds.
-
offset
¶ Offset of this window as seconds since Unix epoch. Windows start at t=N * size + offset where t=0 is the epoch. The offset must be a value in range [0, size). If it is not it will be normalized to this range.
-
-
class
apache_beam.transforms.window.
SlidingWindows
(size, period, offset=0)[source]¶ Bases:
apache_beam.transforms.window.NonMergingWindowFn
A windowing function that assigns each element to a set of sliding windows.
The attributes size and offset determine in what time interval a timestamp will be slotted. The time intervals have the following formula: [N * period + offset, N * period + offset + size)
-
size
¶ Size of the window as seconds.
-
period
¶ Period of the windows as seconds.
-
offset
¶ Offset of this window as seconds since Unix epoch. Windows start at t=N * period + offset where t=0 is the epoch. The offset must be a value in range [0, period). If it is not it will be normalized to this range.
-
-
class
apache_beam.transforms.window.
Sessions
(gap_size)[source]¶ Bases:
apache_beam.transforms.window.WindowFn
A windowing function that groups elements into sessions.
A session is defined as a series of consecutive events separated by a specified gap size.
-
gap_size
¶ Size of the gap between windows as floating-point seconds.
-
Module contents¶
PTransform and descendants.