apache_beam.pipeline module
Pipeline, the top-level Beam object.
A pipeline holds a DAG of data transforms. Conceptually the nodes of the DAG
are transforms (PTransform
objects)
and the edges are values (mostly PCollection
objects). The transforms take as inputs one or more PValues and output one or
more PValue
s.
The pipeline offers functionality to traverse the graph. The actual operation to be executed for each node visited is specified through a runner object.
Typical usage:
# Create a pipeline object using a local runner for execution.
with beam.Pipeline('DirectRunner') as p:
# Add to the pipeline a "Create" transform. When executed this
# transform will produce a PCollection object with the specified values.
pcoll = p | 'Create' >> beam.Create([1, 2, 3])
# Another transform could be applied to pcoll, e.g., writing to a text file.
# For other transforms, refer to transforms/ directory.
pcoll | 'Write' >> beam.io.WriteToText('./output')
# run() will execute the DAG stored in the pipeline. The execution of the
# nodes visited is done using the specified local runner.
- class apache_beam.pipeline.Pipeline(runner: str | PipelineRunner | None = None, options: PipelineOptions | None = None, argv: List[str] | None = None, display_data: Dict[str, Any] | None = None)[source]
Bases:
HasDisplayData
A pipeline object that manages a DAG of
PValue
s and theirPTransform
s.Conceptually the
PValue
s are the DAG’s nodes and thePTransform
s computing thePValue
s are the edges.All the transforms applied to the pipeline must have distinct full labels. If same transform instance needs to be applied then the right shift operator should be used to designate new names (e.g.
input | "label" >> my_transform
).Initialize a pipeline object.
- Parameters:
runner (PipelineRunner) – An object of type
PipelineRunner
that will be used to execute the pipeline. For registered runners, the runner name can be specified, otherwise a runner object must be supplied.options (PipelineOptions) – A configured
PipelineOptions
object containing arguments that should be used for running the Beam job.argv (List[str]) – a list of arguments (such as
sys.argv
) to be used for building aPipelineOptions
object. This will only be used if argument options isNone
.display_data (Dict[str, Any]) – a dictionary of static data associated with this pipeline that can be displayed when it runs.
- Raises:
ValueError – if either the runner or options argument is not of the expected type.
- property options: PipelineOptions
- replace_all(replacements: Iterable[PTransformOverride]) None [source]
Dynamically replaces PTransforms in the currently populated hierarchy.
Currently this only works for replacements where input and output types are exactly the same.
TODO: Update this to also work for transform overrides where input and output types are different.
- Parameters:
replacements (List[PTransformOverride]) – a list of
PTransformOverride
objects.
- run(test_runner_api: bool | str = 'AUTO') PipelineResult [source]
Runs the pipeline. Returns whatever our runner returns after running.
- visit(visitor: PipelineVisitor) None [source]
Visits depth-first every node of a pipeline’s DAG.
Runner-internal implementation detail; no backwards-compatibility guarantees
- Parameters:
visitor (PipelineVisitor) –
PipelineVisitor
object whose callbacks will be called for each node visited. SeePipelineVisitor
comments.- Raises:
TypeError – if node is specified and is not a
PValue
.PipelineError – if node is specified and does not belong to this pipeline instance.
- apply(transform: PTransform, pvalueish: PValue | None = None, label: str | None = None) PValue [source]
Applies a custom transform using the pvalueish specified.
- Parameters:
transform (PTransform) – the
PTransform
to apply.pvalueish (PCollection) – the input for the
PTransform
(typically aPCollection
).label (str) – label of the
PTransform
.
- Raises:
TypeError – if the transform object extracted from the argument list is not a
PTransform
.RuntimeError – if the transform object was already applied to this pipeline and needs to be cloned in order to apply again.
- to_runner_api(return_context: bool = False, context: PipelineContext | None = None, use_fake_coders: bool = False, default_environment: environments.Environment | None = None) beam_runner_api_pb2.Pipeline [source]
For internal use only; no backwards-compatibility guarantees.
- static merge_compatible_environments(proto)[source]
Tries to minimize the number of distinct environments by merging those that are compatible (currently defined as identical).
Mutates proto as contexts may have references to proto.components.
- static from_runner_api(proto: Pipeline, runner: PipelineRunner, options: PipelineOptions, return_context: bool = False) Pipeline [source]
For internal use only; no backwards-compatibility guarantees.
- class apache_beam.pipeline.PTransformOverride[source]
Bases:
object
For internal use only; no backwards-compatibility guarantees.
Gives a matcher and replacements for matching PTransforms.
TODO: Update this to support cases where input and/our output types are different.
- abstract matches(applied_ptransform: AppliedPTransform) bool [source]
Determines whether the given AppliedPTransform matches.
Note that the matching will happen after Runner API proto translation. If matching is done via type checks, to/from_runner_api[_parameter] methods must be implemented to preserve the type (and other data) through proto serialization.
Consider URN-based translation instead.
- Parameters:
applied_ptransform – AppliedPTransform to be matched.
- Returns:
a bool indicating whether the given AppliedPTransform is a match.
- get_replacement_transform_for_applied_ptransform(applied_ptransform: AppliedPTransform) PTransform [source]
Provides a runner specific override for a given AppliedPTransform.
- Parameters:
applied_ptransform – AppliedPTransform containing the PTransform to be replaced.
- Returns:
A PTransform that will be the replacement for the PTransform inside the AppliedPTransform given as an argument.
- get_replacement_transform(ptransform: PTransform | None) PTransform [source]
Provides a runner specific override for a given PTransform.
- Parameters:
ptransform – PTransform to be replaced.
- Returns:
A PTransform that will be the replacement for the PTransform given as an argument.
- get_replacement_inputs(applied_ptransform: AppliedPTransform) Iterable[PValue] [source]
Provides inputs that will be passed to the replacement PTransform.
- Parameters:
applied_ptransform – Original AppliedPTransform containing the PTransform to be replaced.
- Returns:
An iterable of PValues that will be passed to the expand() method of the replacement PTransform.