apache_beam.pipeline module¶
Pipeline, the top-level Beam object.
A pipeline holds a DAG of data transforms. Conceptually the nodes of the DAG
are transforms (PTransform
objects)
and the edges are values (mostly PCollection
objects). The transforms take as inputs one or more PValues and output one or
more PValue
s.
The pipeline offers functionality to traverse the graph. The actual operation to be executed for each node visited is specified through a runner object.
Typical usage:
# Create a pipeline object using a local runner for execution.
with beam.Pipeline('DirectRunner') as p:
# Add to the pipeline a "Create" transform. When executed this
# transform will produce a PCollection object with the specified values.
pcoll = p | 'Create' >> beam.Create([1, 2, 3])
# Another transform could be applied to pcoll, e.g., writing to a text file.
# For other transforms, refer to transforms/ directory.
pcoll | 'Write' >> beam.io.WriteToText('./output')
# run() will execute the DAG stored in the pipeline. The execution of the
# nodes visited is done using the specified local runner.
-
class
apache_beam.pipeline.
Pipeline
(runner=None, options=None, argv=None)[source]¶ Bases:
object
A pipeline object that manages a DAG of
PValue
s and theirPTransform
s.Conceptually the
PValue
s are the DAG’s nodes and thePTransform
s computing thePValue
s are the edges.All the transforms applied to the pipeline must have distinct full labels. If same transform instance needs to be applied then the right shift operator should be used to designate new names (e.g.
input | "label" >> my_tranform
).Initialize a pipeline object.
Parameters: - runner (PipelineRunner) – An object of
type
PipelineRunner
that will be used to execute the pipeline. For registered runners, the runner name can be specified, otherwise a runner object must be supplied. - options (PipelineOptions) – A configured
PipelineOptions
object containing arguments that should be used for running the Beam job. - argv (List[str]) – a list of arguments (such as
sys.argv
) to be used for building aPipelineOptions
object. This will only be used if argument options isNone
.
Raises: ValueError
– if either the runner or options argument is not of the expected type.-
options
¶
-
replace_all
(replacements)[source]¶ Dynamically replaces PTransforms in the currently populated hierarchy.
Currently this only works for replacements where input and output types are exactly the same.
TODO: Update this to also work for transform overrides where input and output types are different.
Parameters: replacements (List[PTransformOverride]) – a list of PTransformOverride
objects.
-
run
(test_runner_api=True)[source]¶ Runs the pipeline. Returns whatever our runner returns after running.
-
visit
(visitor)[source]¶ Visits depth-first every node of a pipeline’s DAG.
Runner-internal implementation detail; no backwards-compatibility guarantees
Parameters: visitor (PipelineVisitor) –
PipelineVisitor
object whose callbacks will be called for each node visited. SeePipelineVisitor
comments.Raises: TypeError
– if node is specified and is not aPValue
.PipelineError
– if node is specified and does not belong to this pipeline instance.
-
apply
(transform, pvalueish=None, label=None)[source]¶ Applies a custom transform using the pvalueish specified.
Parameters: - transform (PTransform) – the
PTransform
to apply. - pvalueish (PCollection) – the input for the
PTransform
(typically aPCollection
). - label (str) – label of the
PTransform
.
Raises: TypeError
– if the transform object extracted from the argument list is not aPTransform
.RuntimeError
– if the transform object was already applied to this pipeline and needs to be cloned in order to apply again.
- transform (PTransform) – the
- runner (PipelineRunner) – An object of
type
-
class
apache_beam.pipeline.
PTransformOverride
[source]¶ Bases:
object
For internal use only; no backwards-compatibility guarantees.
Gives a matcher and replacements for matching PTransforms.
TODO: Update this to support cases where input and/our output types are different.