Class Pipeline
- Direct Known Subclasses:
TestPipeline
Pipeline
manages a directed acyclic graph of PTransforms
, and the
PCollections
that the PTransforms
consume and produce.
Each Pipeline
is self-contained and isolated from any other Pipeline
. The
PValues
that are inputs and outputs of each of a Pipeline's
PTransforms
are also owned by that Pipeline
. A PValue
owned by
one Pipeline
can be read only by PTransforms
also owned by that Pipeline
. Pipelines
can safely be executed concurrently.
Here is a typical example of use:
// Start by defining the options for the pipeline.
PipelineOptions options = PipelineOptionsFactory.create();
// Then create the pipeline. The runner is determined by the options.
Pipeline p = Pipeline.create(options);
// A root PTransform, like TextIO.Read or Create, gets added
// to the Pipeline by being applied:
PCollection<String> lines =
p.apply(TextIO.read().from("gs://bucket/dir/file*.txt"));
// A Pipeline can have multiple root transforms:
PCollection<String> moreLines =
p.apply(TextIO.read().from("gs://bucket/other/dir/file*.txt"));
PCollection<String> yetMoreLines =
p.apply(Create.of("yet", "more", "lines").withCoder(StringUtf8Coder.of()));
// Further PTransforms can be applied, in an arbitrary (acyclic) graph.
// Subsequent PTransforms (and intermediate PCollections etc.) are
// implicitly part of the same Pipeline.
PCollection<String> allLines =
PCollectionList.of(lines).and(moreLines).and(yetMoreLines)
.apply(new Flatten<String>());
PCollection<KV<String, Integer>> wordCounts =
allLines
.apply(ParDo.of(new ExtractWords()))
.apply(new Count<String>());
PCollection<String> formattedWordCounts =
wordCounts.apply(ParDo.of(new FormatCounts()));
formattedWordCounts.apply(TextIO.write().to("gs://bucket/dir/counts.txt"));
// PTransforms aren't executed when they're applied, rather they're
// just added to the Pipeline. Once the whole Pipeline of PTransforms
// is constructed, the Pipeline's PTransforms can be run using a
// PipelineRunner. The default PipelineRunner executes the Pipeline
// directly, sequentially, in this one process, which is useful for
// unit tests and simple experiments:
p.run();
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
static interface
For internal use only; no backwards-compatibility guarantees. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription<OutputT extends POutput>
OutputTapply
(String name, PTransform<? super PBegin, OutputT> root) <OutputT extends POutput>
OutputTapply
(PTransform<? super PBegin, OutputT> root) Likeapply(String, PTransform)
but the transform node in thePipeline
graph will be named according toPTransform.getName()
.applyTransform
(InputT input, PTransform<? super InputT, OutputT> transform) For internal use only; no backwards-compatibility guarantees.applyTransform
(String name, InputT input, PTransform<? super InputT, OutputT> transform) For internal use only; no backwards-compatibility guarantees.begin()
Returns aPBegin
owned by this Pipeline.static Pipeline
create()
Constructs a pipeline from defaultPipelineOptions
.static Pipeline
create
(PipelineOptions options) Constructs a pipeline from the providedPipelineOptions
.static Pipeline
forTransformHierarchy
(org.apache.beam.sdk.runners.TransformHierarchy transforms, PipelineOptions options) Returns theCoderRegistry
that thisPipeline
uses.<OutputT extends POutput>
ErrorHandler.BadRecordErrorHandler<OutputT> registerBadRecordErrorHandler
(PTransform<PCollection<BadRecord>, OutputT> sinkTransform) void
replaceAll
(List<org.apache.beam.sdk.runners.PTransformOverride> overrides) For internal use only; no backwards-compatibility guarantees.run()
Runs thisPipeline
according to thePipelineOptions
used to create thePipeline
viacreate(PipelineOptions)
.run
(PipelineOptions options) Runs thisPipeline
using the givenPipelineOptions
, using the runner specified by the options.void
setCoderRegistry
(CoderRegistry coderRegistry) Deprecated.toString()
void
For internal use only; no backwards-compatibility guarantees.
-
Constructor Details
-
Pipeline
-
-
Method Details
-
create
Constructs a pipeline from defaultPipelineOptions
. -
create
Constructs a pipeline from the providedPipelineOptions
. -
begin
Returns aPBegin
owned by this Pipeline. This serves as the input of a rootPTransform
such asRead
orCreate
. -
apply
Likeapply(String, PTransform)
but the transform node in thePipeline
graph will be named according toPTransform.getName()
.- See Also:
-
apply
public <OutputT extends POutput> OutputT apply(String name, PTransform<? super PBegin, OutputT> root) Adds a rootPTransform
, such asRead
orCreate
, to thisPipeline
.The node in the
Pipeline
graph will use the providedname
. This name is used in various places, including the monitoring UI, logging, and to stably identify this node in thePipeline
graph upon update.Alias for
begin().apply(name, root)
. -
forTransformHierarchy
@Internal public static Pipeline forTransformHierarchy(org.apache.beam.sdk.runners.TransformHierarchy transforms, PipelineOptions options) -
getOptions
-
replaceAll
For internal use only; no backwards-compatibility guarantees.Replaces all nodes that match a
PTransformOverride
in this pipeline. Overrides are applied in the order they are present within the list. -
run
Runs thisPipeline
according to thePipelineOptions
used to create thePipeline
viacreate(PipelineOptions)
. -
run
Runs thisPipeline
using the givenPipelineOptions
, using the runner specified by the options. -
getCoderRegistry
Returns theCoderRegistry
that thisPipeline
uses. -
getSchemaRegistry
-
registerBadRecordErrorHandler
public <OutputT extends POutput> ErrorHandler.BadRecordErrorHandler<OutputT> registerBadRecordErrorHandler(PTransform<PCollection<BadRecord>, OutputT> sinkTransform) -
setCoderRegistry
Deprecated.this should never be used - everyPipeline
has a registry throughout its lifetime. -
traverseTopologically
For internal use only; no backwards-compatibility guarantees.Invokes the
PipelineVisitor's
Pipeline.PipelineVisitor.visitPrimitiveTransform(org.apache.beam.sdk.runners.TransformHierarchy.Node)
andPipeline.PipelineVisitor.visitValue(org.apache.beam.sdk.values.PValue, org.apache.beam.sdk.runners.TransformHierarchy.Node)
operations on each of thisPipeline's
transform and value nodes, in forward topological order.Traversal of the
Pipeline
causesPTransforms
andPValues
owned by thePipeline
to be marked as finished, at which point they may no longer be modified.Typically invoked by
PipelineRunner
subclasses. -
applyTransform
@Internal public static <InputT extends PInput,OutputT extends POutput> OutputT applyTransform(InputT input, PTransform<? super InputT, OutputT> transform) For internal use only; no backwards-compatibility guarantees.Like
applyTransform(String, PInput, PTransform)
but defaulting to the name provided by thePTransform
. -
applyTransform
@Internal public static <InputT extends PInput,OutputT extends POutput> OutputT applyTransform(String name, InputT input, PTransform<? super InputT, OutputT> transform) For internal use only; no backwards-compatibility guarantees.Applies the given
PTransform
to this inputInputT
and returns itsOutputT
. This usesname
to identify this specific application of the transform. This name is used in various places, including the monitoring UI, logging, and to stably identify this application node in thePipeline
graph during update.Each
PInput
subclass that provides anapply
method should delegate to this method to ensure proper registration with thePipelineRunner
. -
toString
-
Pipeline
has a registry throughout its lifetime.