Package org.apache.beam.runners.dataflow
Class DataflowRunner
java.lang.Object
org.apache.beam.sdk.PipelineRunner<DataflowPipelineJob>
org.apache.beam.runners.dataflow.DataflowRunner
A
PipelineRunner
that executes the operations in the pipeline by first translating them
to the Dataflow representation using the DataflowPipelineTranslator
and then submitting
them to a Dataflow service for execution.
Permissions
When reading from a Dataflow source or writing to a Dataflow sink using
DataflowRunner
, the Google cloudservices account and the Google compute engine service account
of the GCP project running the Dataflow Job will need access to the corresponding source/sink.
Please see Google Cloud Dataflow Security and Permissions for more details.
DataflowRunner now supports creating job templates using the --templateLocation
option. If this option is set, the runner will generate a template instead of running the
pipeline immediately.
Example:
--runner=DataflowRunner
--templateLocation=gs://your-bucket/templates/my-template
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
static class
A markerDoFn
for writing the contents of aPCollection
to a streamingPCollectionView
backend implementation. -
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline
applySdkEnvironmentOverrides
(org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline pipeline, DataflowPipelineOptions options) static DataflowRunner
fromOptions
(PipelineOptions options) Construct a runner from the provided options.Returns the DataflowPipelineTranslator associated with this object.static boolean
hasExperiment
(DataflowPipelineDebugOptions options, String experiment) Returns true if the specified experiment is enabled, handling null experiments.replaceGcsFilesWithLocalFiles
(List<String> filesToStage) Replaces GCS file paths with local file paths by downloading the GCS files locally.protected void
replaceV1Transforms
(Pipeline pipeline) protected org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline
resolveArtifacts
(org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline pipeline) Processes the givenPipeline
, potentially asynchronously, returning a runner-specific type of result.void
setHooks
(DataflowRunnerHooks hooks) Sets callbacks to invoke during execution seeDataflowRunnerHooks
.protected List
<DataflowPackage> stageArtifacts
(org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline pipeline) toString()
Methods inherited from class org.apache.beam.sdk.PipelineRunner
create, run, run
-
Field Details
-
UNSAFELY_ATTEMPT_TO_PROCESS_UNBOUNDED_DATA_IN_BATCH_MODE
Experiment to "unsafely attempt to process unbounded data in batch mode".- See Also:
-
PROJECT_ID_REGEXP
Project IDs must contain lowercase letters, digits, or dashes. IDs must start with a letter and may not end with a dash. This regex isn't exact - this allows for patterns that would be rejected by the service, but this is sufficient for basic validation of project IDs.- See Also:
-
-
Constructor Details
-
DataflowRunner
-
-
Method Details
-
replaceGcsFilesWithLocalFiles
Replaces GCS file paths with local file paths by downloading the GCS files locally. This is useful when files need to be accessed locally before being staged to Dataflow.- Parameters:
filesToStage
- List of file paths that may contain GCS paths (gs://) and local paths- Returns:
- List of local file paths where any GCS paths have been downloaded locally
- Throws:
RuntimeException
- if there are errors copying GCS files locally
-
fromOptions
Construct a runner from the provided options.- Parameters:
options
- Properties that configure the runner.- Returns:
- The newly created runner.
-
applySdkEnvironmentOverrides
protected org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline applySdkEnvironmentOverrides(org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline pipeline, DataflowPipelineOptions options) -
resolveArtifacts
protected org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline resolveArtifacts(org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline pipeline) -
stageArtifacts
protected List<DataflowPackage> stageArtifacts(org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline pipeline) -
run
Description copied from class:PipelineRunner
Processes the givenPipeline
, potentially asynchronously, returning a runner-specific type of result.- Specified by:
run
in classPipelineRunner<DataflowPipelineJob>
-
hasExperiment
Returns true if the specified experiment is enabled, handling null experiments. -
replaceV1Transforms
-
getTranslator
Returns the DataflowPipelineTranslator associated with this object. -
setHooks
Sets callbacks to invoke during execution seeDataflowRunnerHooks
. -
toString
-