Class Pipeline
- Direct Known Subclasses:
TestPipeline
Pipeline manages a directed acyclic graph of PTransforms, and the
PCollections that the PTransforms consume and produce.
Each Pipeline is self-contained and isolated from any other Pipeline. The
PValues that are inputs and outputs of each of a Pipeline's
PTransforms are also owned by that Pipeline. A PValue owned by
one Pipeline can be read only by PTransforms also owned by that Pipeline. Pipelines can safely be executed concurrently.
Here is a typical example of use:
// Start by defining the options for the pipeline.
PipelineOptions options = PipelineOptionsFactory.create();
// Then create the pipeline. The runner is determined by the options.
Pipeline p = Pipeline.create(options);
// A root PTransform, like TextIO.Read or Create, gets added
// to the Pipeline by being applied:
PCollection<String> lines =
p.apply(TextIO.read().from("gs://bucket/dir/file*.txt"));
// A Pipeline can have multiple root transforms:
PCollection<String> moreLines =
p.apply(TextIO.read().from("gs://bucket/other/dir/file*.txt"));
PCollection<String> yetMoreLines =
p.apply(Create.of("yet", "more", "lines").withCoder(StringUtf8Coder.of()));
// Further PTransforms can be applied, in an arbitrary (acyclic) graph.
// Subsequent PTransforms (and intermediate PCollections etc.) are
// implicitly part of the same Pipeline.
PCollection<String> allLines =
PCollectionList.of(lines).and(moreLines).and(yetMoreLines)
.apply(new Flatten<String>());
PCollection<KV<String, Integer>> wordCounts =
allLines
.apply(ParDo.of(new ExtractWords()))
.apply(new Count<String>());
PCollection<String> formattedWordCounts =
wordCounts.apply(ParDo.of(new FormatCounts()));
formattedWordCounts.apply(TextIO.write().to("gs://bucket/dir/counts.txt"));
// PTransforms aren't executed when they're applied, rather they're
// just added to the Pipeline. Once the whole Pipeline of PTransforms
// is constructed, the Pipeline's PTransforms can be run using a
// PipelineRunner. The default PipelineRunner executes the Pipeline
// directly, sequentially, in this one process, which is useful for
// unit tests and simple experiments:
p.run();
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classstatic interfaceFor internal use only; no backwards-compatibility guarantees. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription<OutputT extends POutput>
OutputTapply(String name, PTransform<? super PBegin, OutputT> root) <OutputT extends POutput>
OutputTapply(PTransform<? super PBegin, OutputT> root) Likeapply(String, PTransform)but the transform node in thePipelinegraph will be named according toPTransform.getName().applyTransform(InputT input, PTransform<? super InputT, OutputT> transform) For internal use only; no backwards-compatibility guarantees.applyTransform(String name, InputT input, PTransform<? super InputT, OutputT> transform) For internal use only; no backwards-compatibility guarantees.begin()Returns aPBeginowned by this Pipeline.static Pipelinecreate()Constructs a pipeline from defaultPipelineOptions.static Pipelinecreate(PipelineOptions options) Constructs a pipeline from the providedPipelineOptions.static PipelineforTransformHierarchy(org.apache.beam.sdk.runners.TransformHierarchy transforms, PipelineOptions options) Returns theCoderRegistrythat thisPipelineuses.<OutputT extends POutput>
ErrorHandler.BadRecordErrorHandler<OutputT> registerBadRecordErrorHandler(PTransform<PCollection<BadRecord>, OutputT> sinkTransform) voidreplaceAll(List<org.apache.beam.sdk.runners.PTransformOverride> overrides) For internal use only; no backwards-compatibility guarantees.run()Runs thisPipelineaccording to thePipelineOptionsused to create thePipelineviacreate(PipelineOptions).run(PipelineOptions options) Runs thisPipelineusing the givenPipelineOptions, using the runner specified by the options.voidsetCoderRegistry(CoderRegistry coderRegistry) Deprecated.toString()voidFor internal use only; no backwards-compatibility guarantees.
-
Constructor Details
-
Pipeline
-
-
Method Details
-
create
Constructs a pipeline from defaultPipelineOptions. -
create
Constructs a pipeline from the providedPipelineOptions. -
begin
Returns aPBeginowned by this Pipeline. This serves as the input of a rootPTransformsuch asReadorCreate. -
apply
Likeapply(String, PTransform)but the transform node in thePipelinegraph will be named according toPTransform.getName().- See Also:
-
apply
public <OutputT extends POutput> OutputT apply(String name, PTransform<? super PBegin, OutputT> root) Adds a rootPTransform, such asReadorCreate, to thisPipeline.The node in the
Pipelinegraph will use the providedname. This name is used in various places, including the monitoring UI, logging, and to stably identify this node in thePipelinegraph upon update.Alias for
begin().apply(name, root). -
forTransformHierarchy
@Internal public static Pipeline forTransformHierarchy(org.apache.beam.sdk.runners.TransformHierarchy transforms, PipelineOptions options) -
getOptions
-
replaceAll
For internal use only; no backwards-compatibility guarantees.Replaces all nodes that match a
PTransformOverridein this pipeline. Overrides are applied in the order they are present within the list. -
run
Runs thisPipelineaccording to thePipelineOptionsused to create thePipelineviacreate(PipelineOptions). -
run
Runs thisPipelineusing the givenPipelineOptions, using the runner specified by the options. -
getCoderRegistry
Returns theCoderRegistrythat thisPipelineuses. -
getSchemaRegistry
-
registerBadRecordErrorHandler
public <OutputT extends POutput> ErrorHandler.BadRecordErrorHandler<OutputT> registerBadRecordErrorHandler(PTransform<PCollection<BadRecord>, OutputT> sinkTransform) -
setCoderRegistry
Deprecated.this should never be used - everyPipelinehas a registry throughout its lifetime. -
traverseTopologically
For internal use only; no backwards-compatibility guarantees.Invokes the
PipelineVisitor'sPipeline.PipelineVisitor.visitPrimitiveTransform(org.apache.beam.sdk.runners.TransformHierarchy.Node)andPipeline.PipelineVisitor.visitValue(org.apache.beam.sdk.values.PValue, org.apache.beam.sdk.runners.TransformHierarchy.Node)operations on each of thisPipeline'stransform and value nodes, in forward topological order.Traversal of the
PipelinecausesPTransformsandPValuesowned by thePipelineto be marked as finished, at which point they may no longer be modified.Typically invoked by
PipelineRunnersubclasses. -
applyTransform
@Internal public static <InputT extends PInput,OutputT extends POutput> OutputT applyTransform(InputT input, PTransform<? super InputT, OutputT> transform) For internal use only; no backwards-compatibility guarantees.Like
applyTransform(String, PInput, PTransform)but defaulting to the name provided by thePTransform. -
applyTransform
@Internal public static <InputT extends PInput,OutputT extends POutput> OutputT applyTransform(String name, InputT input, PTransform<? super InputT, OutputT> transform) For internal use only; no backwards-compatibility guarantees.Applies the given
PTransformto this inputInputTand returns itsOutputT. This usesnameto identify this specific application of the transform. This name is used in various places, including the monitoring UI, logging, and to stably identify this application node in thePipelinegraph during update.Each
PInputsubclass that provides anapplymethod should delegate to this method to ensure proper registration with thePipelineRunner. -
toString
-
Pipelinehas a registry throughout its lifetime.