apache_beam.pipeline module

Pipeline, the top-level Beam object.

A pipeline holds a DAG of data transforms. Conceptually the nodes of the DAG are transforms (PTransform objects) and the edges are values (mostly PCollection objects). The transforms take as inputs one or more PValues and output one or more PValue s.

The pipeline offers functionality to traverse the graph. The actual operation to be executed for each node visited is specified through a runner object.

Typical usage:

# Create a pipeline object using a local runner for execution.
with beam.Pipeline('DirectRunner') as p:

  # Add to the pipeline a "Create" transform. When executed this
  # transform will produce a PCollection object with the specified values.
  pcoll = p | 'Create' >> beam.Create([1, 2, 3])

  # Another transform could be applied to pcoll, e.g., writing to a text file.
  # For other transforms, refer to transforms/ directory.
  pcoll | 'Write' >> beam.io.WriteToText('./output')

  # run() will execute the DAG stored in the pipeline.  The execution of the
  # nodes visited is done using the specified local runner.
class apache_beam.pipeline.Pipeline(runner=None, options=None, argv=None)[source]

Bases: object

A pipeline object that manages a DAG of PValue s and their PTransform s.

Conceptually the PValue s are the DAG’s nodes and the PTransform s computing the PValue s are the edges.

All the transforms applied to the pipeline must have distinct full labels. If same transform instance needs to be applied then the right shift operator should be used to designate new names (e.g. input | "label" >> my_tranform).

Initialize a pipeline object.

Parameters:
  • runner (PipelineRunner) – An object of type PipelineRunner that will be used to execute the pipeline. For registered runners, the runner name can be specified, otherwise a runner object must be supplied.
  • options (PipelineOptions) – A configured PipelineOptions object containing arguments that should be used for running the Beam job.
  • argv (List[str]) – a list of arguments (such as sys.argv) to be used for building a PipelineOptions object. This will only be used if argument options is None.
Raises:

ValueError – if either the runner or options argument is not of the expected type.

options
replace_all(replacements)[source]

Dynamically replaces PTransforms in the currently populated hierarchy.

Currently this only works for replacements where input and output types are exactly the same.

TODO: Update this to also work for transform overrides where input and output types are different.

Parameters:replacements (List[PTransformOverride]) – a list of PTransformOverride objects.
run(test_runner_api=True)[source]

Runs the pipeline. Returns whatever our runner returns after running.

visit(visitor)[source]

Visits depth-first every node of a pipeline’s DAG.

Runner-internal implementation detail; no backwards-compatibility guarantees

Parameters:

visitor (PipelineVisitor) – PipelineVisitor object whose callbacks will be called for each node visited. See PipelineVisitor comments.

Raises:
  • TypeError – if node is specified and is not a PValue.
  • PipelineError – if node is specified and does not belong to this pipeline instance.
apply(transform, pvalueish=None, label=None)[source]

Applies a custom transform using the pvalueish specified.

Parameters:
Raises:
  • TypeError – if the transform object extracted from the argument list is not a PTransform.
  • RuntimeError – if the transform object was already applied to this pipeline and needs to be cloned in order to apply again.
to_runner_api(return_context=False, context=None)[source]

For internal use only; no backwards-compatibility guarantees.

static from_runner_api(proto, runner, options, return_context=False)[source]

For internal use only; no backwards-compatibility guarantees.

class apache_beam.pipeline.PTransformOverride[source]

Bases: object

For internal use only; no backwards-compatibility guarantees.

Gives a matcher and replacements for matching PTransforms.

TODO: Update this to support cases where input and/our output types are different.

matches(applied_ptransform)[source]

Determines whether the given AppliedPTransform matches.

Note that the matching will happen after Runner API proto translation. If matching is done via type checks, to/from_runner_api[_parameter] methods must be implemented to preserve the type (and other data) through proto serialization.

Consider URN-based translation instead.

Parameters:applied_ptransform – AppliedPTransform to be matched.
Returns:a bool indicating whether the given AppliedPTransform is a match.
get_replacement_transform(ptransform)[source]

Provides a runner specific override for a given PTransform.

Parameters:ptransform – PTransform to be replaced.
Returns:A PTransform that will be the replacement for the PTransform given as an argument.