apache_beam.transforms package

Submodules

apache_beam.transforms.combiners module

A library of basic combiner PTransform subclasses.

class apache_beam.transforms.combiners.Count[source]

Bases: object

Combiners for counting elements.

class Globally(label=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

combiners.Count.Globally counts the total number of elements.

expand(pcoll)[source]
class PerElement(label=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

combiners.Count.PerElement counts how many times each element occurs.

expand(pcoll)[source]
class PerKey(label=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

combiners.Count.PerKey counts how many elements each unique key has.

expand(pcoll)[source]
class apache_beam.transforms.combiners.Mean[source]

Bases: object

Combiners for computing arithmetic means of elements.

class Globally(label=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

combiners.Mean.Globally computes the arithmetic mean of the elements.

expand(pcoll)[source]
class PerKey(label=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

combiners.Mean.PerKey finds the means of the values for each key.

expand(pcoll)[source]
class apache_beam.transforms.combiners.Sample[source]

Bases: object

Combiners for sampling n elements without replacement.

FixedSizeGlobally = <CallablePTransform(PTransform) label=[FixedSizeGlobally]>
FixedSizePerKey = <CallablePTransform(PTransform) label=[FixedSizePerKey]>
class apache_beam.transforms.combiners.Top[source]

Bases: object

Combiners for obtaining extremal elements.

Largest = <CallablePTransform(PTransform) label=[Largest]>
LargestPerKey = <CallablePTransform(PTransform) label=[LargestPerKey]>
Of = <CallablePTransform(PTransform) label=[Of]>
PerKey = <CallablePTransform(PTransform) label=[PerKey]>
Smallest = <CallablePTransform(PTransform) label=[Smallest]>
SmallestPerKey = <CallablePTransform(PTransform) label=[SmallestPerKey]>
class apache_beam.transforms.combiners.ToDict(label='ToDict')[source]

Bases: apache_beam.transforms.ptransform.PTransform

A global CombineFn that condenses a PCollection into a single dict.

PCollections should consist of 2-tuples, notionally (key, value) pairs. If multiple values are associated with the same key, only one of the values will be present in the resulting dict.

expand(pcoll)[source]
class apache_beam.transforms.combiners.ToList(label='ToList')[source]

Bases: apache_beam.transforms.ptransform.PTransform

A global CombineFn that condenses a PCollection into a single list.

expand(pcoll)[source]

apache_beam.transforms.core module

Core PTransform subclasses, such as FlatMap, GroupByKey, and Map.

class apache_beam.transforms.core.DoFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.typehints.decorators.WithTypeHints, apache_beam.transforms.display.HasDisplayData

A function object used by a transform with custom processing.

The ParDo transform is such a transform. The ParDo.apply method will take an object of type DoFn and apply it to all elements of a PCollection object.

In order to have concrete DoFn objects one has to subclass from DoFn and define the desired behavior (start_bundle/finish_bundle and process) or wrap a callable object using the CallableWrapperDoFn class.

DoFnParams = ['ElementParam', 'SideInputParam', 'TimestampParam', 'WindowParam']
ElementParam = 'ElementParam'
SideInputParam = 'SideInputParam'
TimestampParam = 'TimestampParam'
WindowParam = 'WindowParam'
default_label()[source]
finish_bundle()[source]

Called after a bundle of elements is processed on a worker.

static from_callable(fn)[source]
get_function_arguments(func)[source]

Return the function arguments based on the name provided. If they have a _inspect_function attached to the class then use that otherwise default to the python inspect library.

infer_output_type(input_type)[source]
is_process_bounded()[source]

Checks if an object is a bound method on an instance.

process(element, *args, **kwargs)[source]

Called for each element of a pipeline. The default arguments are needed for the DoFnRunner to be able to pass the parameters correctly.

Parameters:
  • element – The element to be processed
  • *args – side inputs
  • **kwargs – keyword side inputs
start_bundle()[source]

Called before a bundle of elements is processed on a worker.

Elements to be processed are split into bundles and distributed to workers. Before a worker calls process() on the first element of its bundle, it calls this method.

class apache_beam.transforms.core.CombineFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.typehints.decorators.WithTypeHints, apache_beam.transforms.display.HasDisplayData

A function object used by a Combine transform with custom processing.

A CombineFn specifies how multiple values in all or part of a PCollection can be merged into a single value—essentially providing the same kind of information as the arguments to the Python “reduce” builtin (except for the input argument, which is an instance of CombineFnProcessContext). The combining process proceeds as follows:

  1. Input values are partitioned into one or more batches.
  2. For each batch, the create_accumulator method is invoked to create a fresh initial “accumulator” value representing the combination of zero values.
  3. For each input value in the batch, the add_input method is invoked to combine more values with the accumulator for that batch.
  4. The merge_accumulators method is invoked to combine accumulators from separate batches into a single combined output accumulator value, once all of the accumulators have had all the input value in their batches added to them. This operation is invoked repeatedly, until there is only one accumulator value left.
  5. The extract_output operation is invoked on the final accumulator to get the output value.
add_input(accumulator, element, *args, **kwargs)[source]

Return result of folding element into accumulator.

CombineFn implementors must override add_input.

Parameters:
  • accumulator – the current accumulator
  • element – the element to add
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
add_inputs(accumulator, elements, *args, **kwargs)[source]

Returns the result of folding each element in elements into accumulator.

This is provided in case the implementation affords more efficient bulk addition of elements. The default implementation simply loops over the inputs invoking add_input for each one.

Parameters:
  • accumulator – the current accumulator
  • elements – the elements to add
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
apply(elements, *args, **kwargs)[source]

Returns result of applying this CombineFn to the input values.

Parameters:
  • elements – the set of values to combine.
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
create_accumulator(*args, **kwargs)[source]

Return a fresh, empty accumulator for the combine operation.

Parameters:
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
default_label()[source]
extract_output(accumulator, *args, **kwargs)[source]

Return result of converting accumulator into the output value.

Parameters:
  • accumulator – the final accumulator value computed by this CombineFn for the entire input key or PCollection.
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
for_input_type(input_type)[source]

Returns a specialized implementation of self, if it exists.

Otherwise, returns self.

Parameters:input_type – the type of input elements.
static from_callable(fn)[source]
static maybe_from_callable(fn)[source]
merge_accumulators(accumulators, *args, **kwargs)[source]

Returns the result of merging several accumulators to a single accumulator value.

Parameters:
  • accumulators – the accumulators to merge
  • *args – Additional arguments and side inputs.
  • **kwargs – Additional arguments and side inputs.
class apache_beam.transforms.core.PartitionFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.typehints.decorators.WithTypeHints

A function object used by a Partition transform.

A PartitionFn specifies how individual values in a PCollection will be placed into separate partitions, indexed by an integer.

default_label()[source]
partition_for(element, num_partitions, *args, **kwargs)[source]

Specify which partition will receive this element.

Parameters:
  • element – An element of the input PCollection.
  • num_partitions – Number of partitions, i.e., output PCollections.
  • *args – optional parameters and side inputs.
  • **kwargs – optional parameters and side inputs.
Returns:

An integer in [0, num_partitions).

class apache_beam.transforms.core.ParDo(fn, *args, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransformWithSideInputs

A ParDo transform.

Processes an input PCollection by applying a DoFn to each element and returning the accumulated results into an output PCollection. The type of the elements is not fixed as long as the DoFn can deal with it. In reality the type is restrained to some extent because the elements sometimes must be persisted to external storage. See the expand() method comments for a detailed description of all possible arguments.

Note that the DoFn must return an iterable for each element of the input PCollection. An easy way to do this is to use the yield keyword in the process method.

Parameters:
  • pcoll – a PCollection to be processed.
  • fn – a DoFn object to be applied to each element of pcoll argument.
  • *args – positional arguments passed to the dofn object.
  • **kwargs – keyword arguments passed to the dofn object.

Note that the positional and keyword arguments will be processed in order to detect PCollections that will be computed as side inputs to the transform. During pipeline execution whenever the DoFn object gets executed (its apply() method gets called) the PCollection arguments will be replaced by values from the PCollection in the exact positions where they appear in the argument lists.

default_type_hints()[source]
display_data()[source]
expand(pcoll)[source]
infer_output_type(input_type)[source]
make_fn(fn)[source]
with_outputs(*tags, **main_kw)[source]

Returns a tagged tuple allowing access to the outputs of a ParDo.

The resulting object supports access to the PCollection associated with a tag (e.g., o.tag, o[tag]) and iterating over the available tags (e.g., for tag in o: ...).

Parameters:
  • *tags – if non-empty, list of valid tags. If a list of valid tags is given, it will be an error to use an undeclared tag later in the pipeline.
  • **main_kw – dictionary empty or with one key ‘main’ defining the tag to be used for the main output (which will not have a tag associated with it).
Returns:

An object of type DoOutputsTuple that bundles together all the outputs of a ParDo transform and allows accessing the individual PCollections for each output using an object.tag syntax.

Raises:
  • TypeError – if the self object is not a PCollection that is the result of a ParDo transform.
  • ValueError – if main_kw contains any key other than ‘main’.
apache_beam.transforms.core.FlatMap(fn, *args, **kwargs)[source]

FlatMap is like ParDo except it takes a callable to specify the transformation.

The callable must return an iterable for each element of the input PCollection. The elements of these iterables will be flattened into the output PCollection.

Parameters:
  • fn – a callable object.
  • *args – positional arguments passed to the transform callable.
  • **kwargs – keyword arguments passed to the transform callable.
Returns:

A PCollection containing the Map outputs.

Raises:

TypeError – If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for ParDo.

apache_beam.transforms.core.Map(fn, *args, **kwargs)[source]

Map is like FlatMap except its callable returns only a single element.

Parameters:
  • fn – a callable object.
  • *args – positional arguments passed to the transform callable.
  • **kwargs – keyword arguments passed to the transform callable.
Returns:

A PCollection containing the Map outputs.

Raises:

TypeError – If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for ParDo.

apache_beam.transforms.core.Filter(fn, *args, **kwargs)[source]

Filter is a FlatMap with its callable filtering out elements.

Parameters:
  • fn – a callable object.
  • *args – positional arguments passed to the transform callable.
  • **kwargs – keyword arguments passed to the transform callable.
Returns:

A PCollection containing the Filter outputs.

Raises:

TypeError – If the fn passed as argument is not a callable. Typical error is to pass a DoFn instance which is supported only for FlatMap.

class apache_beam.transforms.core.CombineGlobally(fn, *args, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A CombineGlobally transform.

Reduces a PCollection to a single value by progressively applying a CombineFn to portions of the PCollection (and to intermediate values created thereby). See documentation in CombineFn for details on the specifics on how CombineFns are applied.

Parameters:
  • pcoll – a PCollection to be reduced into a single value.
  • fn – a CombineFn object that will be called to progressively reduce the PCollection into single values, or a callable suitable for wrapping by CallableWrapperCombineFn.
  • *args – positional arguments passed to the CombineFn object.
  • **kwargs – keyword arguments passed to the CombineFn object.
Raises:

TypeError – If the output type of the input PCollection is not compatible with Iterable[A].

Returns:

A single-element PCollection containing the main output of the Combine transform.

Note that the positional and keyword arguments will be processed in order to detect PObjects that will be computed as side inputs to the transform. During pipeline execution whenever the CombineFn object gets executed (i.e., any of the CombineFn methods get called), the PObject arguments will be replaced by their actual value in the exact position where they appear in the argument lists.

as_singleton_view()[source]
as_view = False
default_label()[source]
display_data()[source]
expand(pcoll)[source]
has_defaults = True
with_defaults(has_defaults=True)[source]
without_defaults()[source]
class apache_beam.transforms.core.CombinePerKey(fn, *args, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransformWithSideInputs

A per-key Combine transform.

Identifies sets of values associated with the same key in the input PCollection, then applies a CombineFn to condense those sets to single values. See documentation in CombineFn for details on the specifics on how CombineFns are applied.

Parameters:
  • pcoll – input pcollection.
  • fn – instance of CombineFn to apply to all values under the same key in pcoll, or a callable whose signature is f(iterable, *args, **kwargs) (e.g., sum, max).
  • *args – arguments and side inputs, passed directly to the CombineFn.
  • **kwargs – arguments and side inputs, passed directly to the CombineFn.
Returns:

A PObject holding the result of the combine operation.

default_label()[source]
display_data()[source]
expand(pcoll)[source]
make_fn(fn)[source]
class apache_beam.transforms.core.CombineValues(fn, *args, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransformWithSideInputs

expand(pcoll)[source]
make_fn(fn)[source]
class apache_beam.transforms.core.GroupByKey(label=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A group by key transform.

Processes an input PCollection consisting of key/value pairs represented as a tuple pair. The result is a PCollection where values having a common key are grouped together. For example (a, 1), (b, 2), (a, 3) will result into (a, [1, 3]), (b, [2]).

The implementation here is used only when run on the local direct runner.

class GroupAlsoByWindow(windowing)[source]

Bases: apache_beam.transforms.core.DoFn

infer_output_type(input_type)[source]
process(element)[source]
start_bundle()[source]
class ReifyWindows(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.core.DoFn

infer_output_type(input_type)[source]
process(element, window='WindowParam', timestamp='TimestampParam')[source]
expand(pcoll)[source]
class apache_beam.transforms.core.Partition(fn, *args, **kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransformWithSideInputs

Split a PCollection into several partitions.

Uses the specified PartitionFn to separate an input PCollection into the specified number of sub-PCollections.

When apply()d, a Partition() PTransform requires the following:

Parameters:
  • partitionfn – a PartitionFn, or a callable with the signature described in CallableWrapperPartitionFn.
  • n – number of output partitions.

The result of this PTransform is a simple list of the output PCollections representing each of n partitions, in order.

class ApplyPartitionFnFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.core.DoFn

A DoFn that applies a PartitionFn.

process(element, partitionfn, n, *args, **kwargs)[source]
expand(pcoll)[source]
make_fn(fn)[source]
class apache_beam.transforms.core.Windowing(windowfn, triggerfn=None, accumulation_mode=None, timestamp_combiner=None)[source]

Bases: object

static from_runner_api(proto, context)[source]
is_default()[source]
to_runner_api(context)[source]
class apache_beam.transforms.core.WindowInto(windowfn, **kwargs)[source]

Bases: apache_beam.transforms.core.ParDo

A window transform assigning windows to each element of a PCollection.

Transforms an input PCollection by applying a windowing function to each element. Each transformed element in the result will be a WindowedValue element with the same input value and timestamp, with its new set of windows determined by the windowing function.

class WindowIntoFn(windowing)[source]

Bases: apache_beam.transforms.core.DoFn

A DoFn that applies a WindowInto operation.

process(element, timestamp='TimestampParam')[source]
expand(pcoll)[source]
static from_runner_api_parameter(proto, context)[source]
get_windowing(unused_inputs)[source]
infer_output_type(input_type)[source]
to_runner_api_parameter(context)[source]
class apache_beam.transforms.core.Flatten(**kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

Merges several PCollections into a single PCollection.

Copies all elements in 0 or more PCollections into a single output PCollection. If there are no input PCollections, the resulting PCollection will be empty (but see also kwargs below).

Parameters:**kwargs – Accepts a single named argument “pipeline”, which specifies the pipeline that “owns” this PTransform. Ordinarily Flatten can obtain this information from one of the input PCollections, but if there are none (or if there’s a chance there may be none), this argument is the only way to provide pipeline information and should be considered mandatory.
expand(pcolls)[source]
static from_runner_api_parameter(unused_parameter, unused_context)[source]
get_windowing(inputs)[source]
to_runner_api_parameter(context)[source]
class apache_beam.transforms.core.Create(value)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A transform that creates a PCollection from an iterable.

expand(pbegin)[source]
get_windowing(unused_inputs)[source]
infer_output_type(unused_input_type)[source]

apache_beam.transforms.cy_combiners module

A library of basic cythonized CombineFn subclasses.

For internal use only; no backwards-compatibility guarantees.

class apache_beam.transforms.cy_combiners.AccumulatorCombineFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.core.CombineFn

static add_input(accumulator, element)[source]
create_accumulator()[source]
static extract_output(accumulator)[source]
merge_accumulators(accumulators)[source]
class apache_beam.transforms.cy_combiners.AllAccumulator[source]

Bases: object

add_input(element)[source]
extract_output()[source]
merge(accumulators)[source]
class apache_beam.transforms.cy_combiners.AllCombineFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.cy_combiners.AccumulatorCombineFn

class apache_beam.transforms.cy_combiners.AnyAccumulator[source]

Bases: object

add_input(element)[source]
extract_output()[source]
merge(accumulators)[source]
class apache_beam.transforms.cy_combiners.AnyCombineFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.cy_combiners.AccumulatorCombineFn

class apache_beam.transforms.cy_combiners.CountAccumulator[source]

Bases: object

add_input(unused_element)[source]
extract_output()[source]
merge(accumulators)[source]
class apache_beam.transforms.cy_combiners.CountCombineFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.cy_combiners.AccumulatorCombineFn

class apache_beam.transforms.cy_combiners.MaxDoubleAccumulator[source]

Bases: object

add_input(element)[source]
extract_output()[source]
merge(accumulators)[source]
class apache_beam.transforms.cy_combiners.MaxFloatFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.cy_combiners.AccumulatorCombineFn

class apache_beam.transforms.cy_combiners.MaxInt64Accumulator[source]

Bases: object

add_input(element)[source]
extract_output()[source]
merge(accumulators)[source]
class apache_beam.transforms.cy_combiners.MaxInt64Fn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.cy_combiners.AccumulatorCombineFn

class apache_beam.transforms.cy_combiners.MeanDoubleAccumulator[source]

Bases: object

add_input(element)[source]
extract_output()[source]
merge(accumulators)[source]
class apache_beam.transforms.cy_combiners.MeanFloatFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.cy_combiners.AccumulatorCombineFn

class apache_beam.transforms.cy_combiners.MeanInt64Accumulator[source]

Bases: object

add_input(element)[source]
extract_output()[source]
merge(accumulators)[source]
class apache_beam.transforms.cy_combiners.MeanInt64Fn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.cy_combiners.AccumulatorCombineFn

class apache_beam.transforms.cy_combiners.MinDoubleAccumulator[source]

Bases: object

add_input(element)[source]
extract_output()[source]
merge(accumulators)[source]
class apache_beam.transforms.cy_combiners.MinFloatFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.cy_combiners.AccumulatorCombineFn

class apache_beam.transforms.cy_combiners.MinInt64Accumulator[source]

Bases: object

add_input(element)[source]
extract_output()[source]
merge(accumulators)[source]
class apache_beam.transforms.cy_combiners.MinInt64Fn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.cy_combiners.AccumulatorCombineFn

class apache_beam.transforms.cy_combiners.SumDoubleAccumulator[source]

Bases: object

add_input(element)[source]
extract_output()[source]
merge(accumulators)[source]
class apache_beam.transforms.cy_combiners.SumFloatFn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.cy_combiners.AccumulatorCombineFn

class apache_beam.transforms.cy_combiners.SumInt64Accumulator[source]

Bases: object

add_input(element)[source]
extract_output()[source]
merge(accumulators)[source]
class apache_beam.transforms.cy_combiners.SumInt64Fn(*unused_args, **unused_kwargs)[source]

Bases: apache_beam.transforms.cy_combiners.AccumulatorCombineFn

apache_beam.transforms.display module

DisplayData, its classes, interfaces and methods.

The classes in this module allow users and transform developers to define static display data to be displayed when a pipeline runs. PTransforms, DoFns and other pipeline components are subclasses of the HasDisplayData mixin. To add static display data to a component, you can override the display_data method of the HasDisplayData class.

Available classes:

  • HasDisplayData - Components that inherit from this class can have static
    display data shown in the UI.
  • DisplayDataItem - This class represents static display data elements.
  • DisplayData - Internal class that is used to create display data and
    communicate it to the API.
class apache_beam.transforms.display.HasDisplayData[source]

Bases: object

Basic mixin for elements that contain display data.

It implements only the display_data method and a _namespace method.

display_data()[source]

Returns the display data associated to a pipeline component.

It should be reimplemented in pipeline components that wish to have static display data.

Returns:value pairs. The value might be an integer, float or string value; a DisplayDataItem for values that have more data (e.g. short value, label, url); or a HasDisplayData instance that has more display data that should be picked up. For example:
{ ‘key1’: ‘string_value’,
‘key2’: 1234, ‘key3’: 3.14159265, ‘key4’: DisplayDataItem(‘apache.org’, url=’http://apache.org‘), ‘key5’: subComponent }
Return type:A dictionary containing key
class apache_beam.transforms.display.DisplayDataItem(value, url=None, label=None, namespace=None, key=None, shortValue=None)[source]

Bases: object

A DisplayDataItem represents a unit of static display data.

Each item is identified by a key and the namespace of the component the display item belongs to.

drop_if_default(default)[source]

The item should be dropped if its value is equal to its default.

Returns:Returns self.
drop_if_none()[source]

The item should be dropped if its value is None.

Returns:Returns self.
get_dict()[source]

Returns the internal-API dictionary representing the DisplayDataItem.

Returns:A dictionary. The internal-API dictionary representing the DisplayDataItem
Raises:ValueError – if the item is not valid.
is_valid()[source]

Checks that all the necessary fields of the DisplayDataItem are filled in. It checks that neither key, namespace, value or type are None.

Raises:ValueError – If the item does not have a key, namespace, value or type.
should_drop()[source]

Return True if the item should be dropped, or False if it should not be dropped. This depends on the drop_if_none, and drop_if_default calls.

Returns:True or False; depending on whether the item should be dropped or kept.
typeDict = {<type 'datetime.datetime'>: 'TIMESTAMP', <type 'bool'>: 'BOOLEAN', <type 'str'>: 'STRING', <type 'datetime.timedelta'>: 'DURATION', <type 'unicode'>: 'STRING', <type 'int'>: 'INTEGER', <type 'float'>: 'FLOAT'}
class apache_beam.transforms.display.DisplayData(namespace, display_data_dict)[source]

Bases: object

Static display data associated with a pipeline component.

classmethod create_from(has_display_data)[source]

Creates DisplayData from a HasDisplayData instance.

Returns:A DisplayData instance with populated items.
Raises:ValueError – If the has_display_data argument is not an instance of HasDisplayData.
classmethod create_from_options(pipeline_options)[source]

Creates DisplayData from a PipelineOptions instance.

When creating DisplayData, this method will convert the value of any item of a non-supported type to its string representation. The normal DisplayData.create_from method rejects those items.

Returns:A DisplayData instance with populated items.
Raises:ValueError – If the has_display_data argument is not an instance of HasDisplayData.

apache_beam.transforms.ptransform module

PTransform and descendants.

A PTransform is an object describing (not executing) a computation. The actual execution semantics for a transform is captured by a runner object. A transform object always belongs to a pipeline object.

A PTransform derived class needs to define the expand() method that describes how one or more PValues are created by the transform.

The module defines a few standard transforms: FlatMap (parallel do), GroupByKey (group by key), etc. Note that the expand() methods for these classes contain code that will add nodes to the processing graph associated with a pipeline.

As support for the FlatMap transform, the module also defines a DoFn class and wrapper class that allows lambda functions to be used as FlatMap processing functions.

class apache_beam.transforms.ptransform.PTransform(label=None)[source]

Bases: apache_beam.typehints.decorators.WithTypeHints, apache_beam.transforms.display.HasDisplayData

A transform object used to modify one or more PCollections.

Subclasses must define an expand() method that will be used when the transform is applied to some arguments. Typical usage pattern will be:

input | CustomTransform(...)

The expand() method of the CustomTransform object passed in will be called with input as an argument.

default_label()[source]
expand(input_or_inputs)[source]
classmethod from_runner_api(proto, context)[source]
static from_runner_api_parameter(spec_parameter, unused_context)[source]
get_windowing(inputs)[source]

Returns the window function to be associated with transform’s output.

By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).

infer_output_type(unused_input_type)[source]
label
pipeline = None
classmethod register_urn(urn, parameter_type, constructor)[source]
side_inputs = ()
to_runner_api(context)[source]
to_runner_api_parameter(context)[source]
type_check_inputs(pvalueish)[source]
type_check_inputs_or_outputs(pvalueish, input_or_output)[source]
type_check_outputs(pvalueish)[source]
with_input_types(input_type_hint)[source]

Annotates the input type of a PTransform with a type-hint.

Parameters:input_type_hint – An instance of an allowed built-in type, a custom class, or an instance of a typehints.TypeConstraint.
Raises:TypeError – If ‘type_hint’ is not a valid type-hint. See typehints.validate_composite_type_param for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
with_output_types(type_hint)[source]

Annotates the output type of a PTransform with a type-hint.

Parameters:type_hint – An instance of an allowed built-in type, a custom class, or a typehints.TypeConstraint.
Raises:TypeError – If ‘type_hint’ is not a valid type-hint. See typehints.validate_composite_type_param for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
apache_beam.transforms.ptransform.ptransform_fn(fn)[source]

A decorator for a function-based PTransform.

Experimental; no backwards-compatibility guarantees.

Parameters:fn – A function implementing a custom PTransform.
Returns:A CallablePTransform instance wrapping the function-based PTransform.

This wrapper provides an alternative, simpler way to define a PTransform. The standard method is to subclass from PTransform and override the expand() method. An equivalent effect can be obtained by defining a function that an input PCollection and additional optional arguments and returns a resulting PCollection. For example:

@ptransform_fn
def CustomMapper(pcoll, mapfn):
  return pcoll | ParDo(mapfn)

The equivalent approach using PTransform subclassing:

class CustomMapper(PTransform):

  def __init__(self, mapfn):
    super(CustomMapper, self).__init__()
    self.mapfn = mapfn

  def expand(self, pcoll):
    return pcoll | ParDo(self.mapfn)

With either method the custom PTransform can be used in pipelines as if it were one of the “native” PTransforms:

result_pcoll = input_pcoll | 'Label' >> CustomMapper(somefn)

Note that for both solutions the underlying implementation of the pipe operator (i.e., |) will inject the pcoll argument in its proper place (first argument if no label was specified and second argument otherwise).

apache_beam.transforms.ptransform.label_from_callable(fn)[source]

apache_beam.transforms.sideinputs module

Internal side input transforms and implementations.

For internal use only; no backwards-compatibility guarantees.

Important: this module is an implementation detail and should not be used directly by pipeline writers. Instead, users should use the helper methods AsSingleton, AsIter, AsList and AsDict in apache_beam.pvalue.

class apache_beam.transforms.sideinputs.SideInputMap(view_class, view_options, iterable)[source]

Bases: object

Represents a mapping of windows to side input values.

is_globally_windowed()[source]
apache_beam.transforms.sideinputs.default_window_mapping_fn(target_window_fn)[source]

apache_beam.transforms.timeutil module

Timestamp utilities.

class apache_beam.transforms.timeutil.TimeDomain[source]

Bases: object

Time domain for streaming timers.

DEPENDENT_REAL_TIME = 'DEPENDENT_REAL_TIME'
REAL_TIME = 'REAL_TIME'
WATERMARK = 'WATERMARK'
static from_string(domain)[source]

apache_beam.transforms.trigger module

Support for Dataflow triggers.

Triggers control when in processing time windows get emitted.

class apache_beam.transforms.trigger.AccumulationMode[source]

Bases: object

Controls what to do with data when a trigger fires multiple times.

ACCUMULATING = 1
DISCARDING = 0
class apache_beam.transforms.trigger.TriggerFn[source]

Bases: object

A TriggerFn determines when window (panes) are emitted.

See https://beam.apache.org/documentation/programming-guide/#triggers

static from_runner_api(proto, context)[source]
on_element(element, window, context)[source]

Called when a new element arrives in a window.

Parameters:
  • element – the element being added
  • window – the window to which the element is being added
  • context – a context (e.g. a TriggerContext instance) for managing state and setting timers
on_fire(watermark, window, context)[source]

Called when a trigger actually fires.

Parameters:
  • watermark – (a lower bound on) the watermark of the system
  • window – the window whose trigger is being fired
  • context – a context (e.g. a TriggerContext instance) for managing state and setting timers
Returns:

whether this trigger is finished

on_merge(to_be_merged, merge_result, context)[source]

Called when multiple windows are merged.

Parameters:
  • to_be_merged – the set of windows to be merged
  • merge_result – the window into which the windows are being merged
  • context – a context (e.g. a TriggerContext instance) for managing state and setting timers
reset(window, context)[source]

Clear any state and timers used by this TriggerFn.

should_fire(watermark, window, context)[source]

Whether this trigger should cause the window to fire.

Parameters:
  • watermark – (a lower bound on) the watermark of the system
  • window – the window whose trigger is being considered
  • context – a context (e.g. a TriggerContext instance) for managing state and setting timers
Returns:

whether this trigger should cause a firing

to_runner_api(unused_context)[source]
class apache_beam.transforms.trigger.DefaultTrigger[source]

Bases: apache_beam.transforms.trigger.TriggerFn

Semantically Repeatedly(AfterWatermark()), but more optimized.

static from_runner_api(proto, context)[source]
on_element(element, window, context)[source]
on_fire(watermark, window, context)[source]
on_merge(to_be_merged, merge_result, context)[source]
reset(window, context)[source]
should_fire(watermark, window, context)[source]
to_runner_api(unused_context)[source]
class apache_beam.transforms.trigger.AfterWatermark(early=None, late=None)[source]

Bases: apache_beam.transforms.trigger.TriggerFn

Fire exactly once when the watermark passes the end of the window.

Parameters:
  • early – if not None, a speculative trigger to repeatedly evaluate before the watermark passes the end of the window
  • late – if not None, a speculative trigger to repeatedly evaluate after the watermark passes the end of the window
LATE_TAG = CombiningValueStateTag(is_late, CallableWrapperCombineFn(<built-in function any>))
static from_runner_api(proto, context)[source]
is_late(context)[source]
on_element(element, window, context)[source]
on_fire(watermark, window, context)[source]
on_merge(to_be_merged, merge_result, context)[source]
reset(window, context)[source]
should_fire(watermark, window, context)[source]
to_runner_api(context)[source]
class apache_beam.transforms.trigger.AfterCount(count)[source]

Bases: apache_beam.transforms.trigger.TriggerFn

Fire when there are at least count elements in this window pane.

AfterCount is experimental. No backwards compatibility guarantees.

COUNT_TAG = CombiningValueStateTag(count, <apache_beam.transforms.combiners.CountCombineFn object>)
static from_runner_api(proto, unused_context)[source]
on_element(element, window, context)[source]
on_fire(watermark, window, context)[source]
on_merge(to_be_merged, merge_result, context)[source]
reset(window, context)[source]
should_fire(watermark, window, context)[source]
to_runner_api(unused_context)[source]
class apache_beam.transforms.trigger.Repeatedly(underlying)[source]

Bases: apache_beam.transforms.trigger.TriggerFn

Repeatedly invoke the given trigger, never finishing.

static from_runner_api(proto, context)[source]
on_element(element, window, context)[source]
on_fire(watermark, window, context)[source]
on_merge(to_be_merged, merge_result, context)[source]
reset(window, context)[source]
should_fire(watermark, window, context)[source]
to_runner_api(context)[source]
class apache_beam.transforms.trigger.AfterAny(*triggers)[source]

Bases: apache_beam.transforms.trigger._ParallelTriggerFn

Fires when any subtrigger fires.

Also finishes when any subtrigger finishes.

combine_op()

any(iterable) -> bool

Return True if bool(x) is True for any x in the iterable. If the iterable is empty, return False.

class apache_beam.transforms.trigger.AfterAll(*triggers)[source]

Bases: apache_beam.transforms.trigger._ParallelTriggerFn

Fires when all subtriggers have fired.

Also finishes when all subtriggers have finished.

combine_op()

all(iterable) -> bool

Return True if bool(x) is True for all values x in the iterable. If the iterable is empty, return True.

class apache_beam.transforms.trigger.AfterEach(*triggers)[source]

Bases: apache_beam.transforms.trigger.TriggerFn

INDEX_TAG = CombiningValueStateTag(index, CallableWrapperCombineFn(<function <lambda>>))
static from_runner_api(proto, context)[source]
on_element(element, window, context)[source]
on_fire(watermark, window, context)[source]
on_merge(to_be_merged, merge_result, context)[source]
reset(window, context)[source]
should_fire(watermark, window, context)[source]
to_runner_api(context)[source]
class apache_beam.transforms.trigger.OrFinally(*triggers)[source]

Bases: apache_beam.transforms.trigger.AfterAny

static from_runner_api(proto, context)[source]
to_runner_api(context)[source]

apache_beam.transforms.util module

Simple utility PTransforms.

class apache_beam.transforms.util.CoGroupByKey(**kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

Groups results across several PCollections by key.

Given an input dict mapping serializable keys (called “tags”) to 0 or more PCollections of (key, value) tuples, e.g.:

{'pc1': pcoll1, 'pc2': pcoll2, 33333: pcoll3}

creates a single output PCollection of (key, value) tuples whose keys are the unique input keys from all inputs, and whose values are dicts mapping each tag to an iterable of whatever values were under the key in the corresponding PCollection:

('some key', {'pc1': ['value 1 under "some key" in pcoll1',
                      'value 2 under "some key" in pcoll1'],
              'pc2': [],
              33333: ['only value under "some key" in pcoll3']})

Note that pcoll2 had no values associated with “some key”.

CoGroupByKey also works for tuples, lists, or other flat iterables of PCollections, in which case the values of the resulting PCollections will be tuples whose nth value is the list of values from the nth PCollection—conceptually, the “tags” are the indices into the input. Thus, for this input:

(pcoll1, pcoll2, pcoll3)

the output PCollection’s value for “some key” is:

('some key', (['value 1 under "some key" in pcoll1',
               'value 2 under "some key" in pcoll1'],
              [],
              ['only value under "some key" in pcoll3']))
Parameters:
  • label – name of this transform instance. Useful while monitoring and debugging a pipeline execution.
  • **kwargs – Accepts a single named argument “pipeline”, which specifies the pipeline that “owns” this PTransform. Ordinarily CoGroupByKey can obtain this information from one of the input PCollections, but if there are none (or if there’s a chance there may be none), this argument is the only way to provide pipeline information, and should be considered mandatory.
expand(pcolls)[source]

Performs CoGroupByKey on argument pcolls; see class docstring.

apache_beam.transforms.util.Keys(label='Keys')[source]

Produces a PCollection of first elements of 2-tuples in a PCollection.

apache_beam.transforms.util.KvSwap(label='KvSwap')[source]

Produces a PCollection reversing 2-tuples in a PCollection.

apache_beam.transforms.util.Values(label='Values')[source]

Produces a PCollection of second elements of 2-tuples in a PCollection.

apache_beam.transforms.window module

Windowing concepts.

A WindowInto transform logically divides up or groups the elements of a PCollection into finite windows according to a windowing function (derived from WindowFn).

The output of WindowInto contains the same elements as input, but they have been logically assigned to windows. The next GroupByKey(s) transforms, including one within a composite transform, will group by the combination of keys and windows.

Windowing a PCollection allows chunks of it to be processed individually, before the entire PCollection is available. This is especially important for PCollection(s) with unbounded size, since the full PCollection is never available at once, since more data is continually arriving. For PCollection(s) with a bounded size (aka. conventional batch mode), by default, all data is implicitly in a single window (see GlobalWindows), unless WindowInto is applied.

For example, a simple form of windowing divides up the data into fixed-width time intervals, using FixedWindows.

Seconds are used as the time unit for the built-in windowing primitives here. Integer or floating point seconds can be passed to these primitives.

Internally, seconds, with microsecond granularity, are stored as timeutil.Timestamp and timeutil.Duration objects. This is done to avoid precision errors that would occur with floating point representations.

Custom windowing function classes can be created, by subclassing from WindowFn.

class apache_beam.transforms.window.TimestampCombiner[source]

Bases: object

Determines how output timestamps of grouping operations are assigned.

OUTPUT_AT_EARLIEST = 2
OUTPUT_AT_EARLIEST_TRANSFORMED = 'OUTPUT_AT_EARLIEST_TRANSFORMED'
OUTPUT_AT_EOW = 0
OUTPUT_AT_LATEST = 1
static get_impl(timestamp_combiner, window_fn)[source]
class apache_beam.transforms.window.WindowFn[source]

Bases: apache_beam.utils.urns.RunnerApiFn

An abstract windowing function defining a basic assign and merge.

class AssignContext(timestamp, element=None)[source]

Bases: object

Context passed to WindowFn.assign().

class MergeContext(windows)[source]

Bases: object

Context passed to WindowFn.merge() to perform merging, if any.

merge(to_be_merged, merge_result)[source]
assign(assign_context)[source]

Associates a timestamp to an element.

get_transformed_output_time(window, input_timestamp)[source]

Given input time and output window, returns output time for window.

If TimestampCombiner.OUTPUT_AT_EARLIEST_TRANSFORMED is used in the Windowing, the output timestamp for the given window will be the earliest of the timestamps returned by get_transformed_output_time() for elements of the window.

Parameters:
  • window – Output window of element.
  • input_timestamp – Input timestamp of element as a timeutil.Timestamp object.
Returns:

Transformed timestamp.

get_window_coder()[source]
is_merging()[source]

Returns whether this WindowFn merges windows.

merge(merge_context)[source]

Returns a window that is the result of merging a set of windows.

to_runner_api_parameter(context)
class apache_beam.transforms.window.BoundedWindow(end)[source]

Bases: object

A window for timestamps in range (-infinity, end).

end

End of window.

max_timestamp()[source]
class apache_beam.transforms.window.IntervalWindow(start, end)[source]

Bases: apache_beam.transforms.window.BoundedWindow

A window for timestamps in range [start, end).

start

Start of window as seconds since Unix epoch.

end

End of window as seconds since Unix epoch.

intersects(other)[source]
union(other)[source]
class apache_beam.transforms.window.TimestampedValue(value, timestamp)[source]

Bases: object

A timestamped value having a value and a timestamp.

value

The underlying value.

timestamp

Timestamp associated with the value as seconds since Unix epoch.

class apache_beam.transforms.window.GlobalWindow[source]

Bases: apache_beam.transforms.window.BoundedWindow

The default window into which all data is placed (via GlobalWindows).

class apache_beam.transforms.window.NonMergingWindowFn[source]

Bases: apache_beam.transforms.window.WindowFn

is_merging()[source]
merge(merge_context)[source]
class apache_beam.transforms.window.GlobalWindows[source]

Bases: apache_beam.transforms.window.NonMergingWindowFn

A windowing function that assigns everything to one global window.

assign(assign_context)[source]
static from_runner_api_parameter(unused_fn_parameter, unused_context)[source]
get_window_coder()[source]
to_runner_api_parameter(context)[source]
classmethod windowed_value(value, timestamp=Timestamp(-9223372036854.775808))[source]
class apache_beam.transforms.window.FixedWindows(size, offset=0)[source]

Bases: apache_beam.transforms.window.NonMergingWindowFn

A windowing function that assigns each element to one time interval.

The attributes size and offset determine in what time interval a timestamp will be slotted. The time intervals have the following formula: [N * size + offset, (N + 1) * size + offset)

size

Size of the window as seconds.

offset

Offset of this window as seconds since Unix epoch. Windows start at t=N * size + offset where t=0 is the epoch. The offset must be a value in range [0, size). If it is not it will be normalized to this range.

assign(context)[source]
static from_runner_api_parameter(fn_parameter, unused_context)[source]
get_window_coder()[source]
to_runner_api_parameter(context)[source]
class apache_beam.transforms.window.SlidingWindows(size, period, offset=0)[source]

Bases: apache_beam.transforms.window.NonMergingWindowFn

A windowing function that assigns each element to a set of sliding windows.

The attributes size and offset determine in what time interval a timestamp will be slotted. The time intervals have the following formula: [N * period + offset, N * period + offset + size)

size

Size of the window as seconds.

period

Period of the windows as seconds.

offset

Offset of this window as seconds since Unix epoch. Windows start at t=N * period + offset where t=0 is the epoch. The offset must be a value in range [0, period). If it is not it will be normalized to this range.

assign(context)[source]
static from_runner_api_parameter(fn_parameter, unused_context)[source]
get_window_coder()[source]
to_runner_api_parameter(context)[source]
class apache_beam.transforms.window.Sessions(gap_size)[source]

Bases: apache_beam.transforms.window.WindowFn

A windowing function that groups elements into sessions.

A session is defined as a series of consecutive events separated by a specified gap size.

gap_size

Size of the gap between windows as floating-point seconds.

assign(context)[source]
static from_runner_api_parameter(fn_parameter, unused_context)[source]
get_window_coder()[source]
merge(merge_context)[source]
to_runner_api_parameter(context)[source]

Module contents

PTransform and descendants.