public class Pipeline
extends java.lang.Object
Pipeline manages a directed acyclic graph of PTransforms, and the
PCollections that the PTransforms consume and produce.
Each Pipeline is self-contained and isolated from any other Pipeline. The
PValues that are inputs and outputs of each of a Pipeline's
PTransforms are also owned by that Pipeline. A PValue owned by
one Pipeline can be read only by PTransforms also owned by that Pipeline. Pipelines can safely be executed concurrently.
Here is a typical example of use:
// Start by defining the options for the pipeline.
PipelineOptions options = PipelineOptionsFactory.create();
// Then create the pipeline. The runner is determined by the options.
Pipeline p = Pipeline.create(options);
// A root PTransform, like TextIO.Read or Create, gets added
// to the Pipeline by being applied:
PCollection<String> lines =
p.apply(TextIO.read().from("gs://bucket/dir/file*.txt"));
// A Pipeline can have multiple root transforms:
PCollection<String> moreLines =
p.apply(TextIO.read().from("gs://bucket/other/dir/file*.txt"));
PCollection<String> yetMoreLines =
p.apply(Create.of("yet", "more", "lines").withCoder(StringUtf8Coder.of()));
// Further PTransforms can be applied, in an arbitrary (acyclic) graph.
// Subsequent PTransforms (and intermediate PCollections etc.) are
// implicitly part of the same Pipeline.
PCollection<String> allLines =
PCollectionList.of(lines).and(moreLines).and(yetMoreLines)
.apply(new Flatten<String>());
PCollection<KV<String, Integer>> wordCounts =
allLines
.apply(ParDo.of(new ExtractWords()))
.apply(new Count<String>());
PCollection<String> formattedWordCounts =
wordCounts.apply(ParDo.of(new FormatCounts()));
formattedWordCounts.apply(TextIO.write().to("gs://bucket/dir/counts.txt"));
// PTransforms aren't executed when they're applied, rather they're
// just added to the Pipeline. Once the whole Pipeline of PTransforms
// is constructed, the Pipeline's PTransforms can be run using a
// PipelineRunner. The default PipelineRunner executes the Pipeline
// directly, sequentially, in this one process, which is useful for
// unit tests and simple experiments:
p.run();
| Modifier and Type | Class and Description |
|---|---|
static class |
Pipeline.PipelineExecutionException
|
static interface |
Pipeline.PipelineVisitor
For internal use only; no backwards-compatibility guarantees.
|
| Modifier | Constructor and Description |
|---|---|
protected |
Pipeline(PipelineOptions options) |
| Modifier and Type | Method and Description |
|---|---|
<OutputT extends POutput> |
apply(PTransform<? super PBegin,OutputT> root)
Like
apply(String, PTransform) but the transform node in the Pipeline graph
will be named according to PTransform.getName(). |
<OutputT extends POutput> |
apply(java.lang.String name,
PTransform<? super PBegin,OutputT> root)
|
static <InputT extends PInput,OutputT extends POutput> |
applyTransform(InputT input,
PTransform<? super InputT,OutputT> transform)
For internal use only; no backwards-compatibility guarantees.
|
static <InputT extends PInput,OutputT extends POutput> |
applyTransform(java.lang.String name,
InputT input,
PTransform<? super InputT,OutputT> transform)
For internal use only; no backwards-compatibility guarantees.
|
PBegin |
begin()
Returns a
PBegin owned by this Pipeline. |
static Pipeline |
create()
Constructs a pipeline from default
PipelineOptions. |
static Pipeline |
create(PipelineOptions options)
Constructs a pipeline from the provided
PipelineOptions. |
static Pipeline |
forTransformHierarchy(org.apache.beam.sdk.runners.TransformHierarchy transforms,
PipelineOptions options) |
CoderRegistry |
getCoderRegistry()
Returns the
CoderRegistry that this Pipeline uses. |
PipelineOptions |
getOptions() |
SchemaRegistry |
getSchemaRegistry() |
void |
replaceAll(java.util.List<org.apache.beam.sdk.runners.PTransformOverride> overrides)
For internal use only; no backwards-compatibility guarantees.
|
PipelineResult |
run()
Runs this
Pipeline according to the PipelineOptions used to create the Pipeline via create(PipelineOptions). |
PipelineResult |
run(PipelineOptions options)
Runs this
Pipeline using the given PipelineOptions, using the runner specified
by the options. |
void |
setCoderRegistry(CoderRegistry coderRegistry)
Deprecated.
this should never be used - every
Pipeline has a registry throughout its
lifetime. |
java.lang.String |
toString() |
void |
traverseTopologically(Pipeline.PipelineVisitor visitor)
For internal use only; no backwards-compatibility guarantees.
|
protected Pipeline(PipelineOptions options)
public static Pipeline create()
PipelineOptions.public static Pipeline create(PipelineOptions options)
PipelineOptions.public PBegin begin()
PBegin owned by this Pipeline. This serves as the input of a root PTransform such as Read or Create.public <OutputT extends POutput> OutputT apply(PTransform<? super PBegin,OutputT> root)
apply(String, PTransform) but the transform node in the Pipeline graph
will be named according to PTransform.getName().apply(String, PTransform)public <OutputT extends POutput> OutputT apply(java.lang.String name, PTransform<? super PBegin,OutputT> root)
PTransform, such as Read or Create, to this Pipeline.
The node in the Pipeline graph will use the provided name. This name is used
in various places, including the monitoring UI, logging, and to stably identify this node in
the Pipeline graph upon update.
Alias for begin().apply(name, root).
@Internal public static Pipeline forTransformHierarchy(org.apache.beam.sdk.runners.TransformHierarchy transforms, PipelineOptions options)
@Internal public PipelineOptions getOptions()
@Internal public void replaceAll(java.util.List<org.apache.beam.sdk.runners.PTransformOverride> overrides)
Replaces all nodes that match a PTransformOverride in this pipeline. Overrides are
applied in the order they are present within the list.
After all nodes are replaced, ensures that no nodes in the updated graph match any of the overrides.
public PipelineResult run()
Pipeline according to the PipelineOptions used to create the Pipeline via create(PipelineOptions).public PipelineResult run(PipelineOptions options)
Pipeline using the given PipelineOptions, using the runner specified
by the options.public CoderRegistry getCoderRegistry()
CoderRegistry that this Pipeline uses.@Experimental(value=SCHEMAS) public SchemaRegistry getSchemaRegistry()
@Deprecated public void setCoderRegistry(CoderRegistry coderRegistry)
Pipeline has a registry throughout its
lifetime.@Internal public void traverseTopologically(Pipeline.PipelineVisitor visitor)
Invokes the PipelineVisitor's Pipeline.PipelineVisitor.visitPrimitiveTransform(org.apache.beam.sdk.runners.TransformHierarchy.Node) and Pipeline.PipelineVisitor.visitValue(org.apache.beam.sdk.values.PValue, org.apache.beam.sdk.runners.TransformHierarchy.Node) operations on
each of this Pipeline's transform and value nodes, in forward topological
order.
Traversal of the Pipeline causes PTransforms and PValues owned by the Pipeline to be marked as finished, at which point they may no
longer be modified.
Typically invoked by PipelineRunner subclasses.
@Internal public static <InputT extends PInput,OutputT extends POutput> OutputT applyTransform(InputT input, PTransform<? super InputT,OutputT> transform)
Like applyTransform(String, PInput, PTransform) but defaulting to the name provided
by the PTransform.
@Internal public static <InputT extends PInput,OutputT extends POutput> OutputT applyTransform(java.lang.String name, InputT input, PTransform<? super InputT,OutputT> transform)
Applies the given PTransform to this input InputT and returns its OutputT. This uses name to identify this specific application of the transform. This
name is used in various places, including the monitoring UI, logging, and to stably identify
this application node in the Pipeline graph during update.
Each PInput subclass that provides an apply method should delegate to this
method to ensure proper registration with the PipelineRunner.
public java.lang.String toString()
toString in class java.lang.Object