Using the Google Cloud Dataflow Runner
- Java SDK
- Python SDK
The Google Cloud Dataflow Runner uses the Cloud Dataflow managed service. When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable code and dependencies to a Google Cloud Storage bucket and creates a Cloud Dataflow job, which executes your pipeline on managed resources in Google Cloud Platform.
The Cloud Dataflow Runner and service are suitable for large scale, continuous jobs, and provide:
- a fully managed service
- autoscaling of the number of workers throughout the lifetime of the job
- dynamic work rebalancing
The Beam Capability Matrix documents the supported capabilities of the Cloud Dataflow Runner.
Cloud Dataflow Runner prerequisites and setup
To use the Cloud Dataflow Runner, you must complete the setup in the Before you begin section of the Cloud Dataflow quickstart for your chosen language.
- Select or create a Google Cloud Platform Console project.
- Enable billing for your project.
- Enable the required Google Cloud APIs: Cloud Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, and Cloud Resource Manager. You may need to enable additional APIs (such as BigQuery, Cloud Pub/Sub, or Cloud Datastore) if you use them in your pipeline code.
- Authenticate with Google Cloud Platform.
- Install the Google Cloud SDK.
- Create a Cloud Storage bucket.
Specify your dependency
When using Java, you must specify your dependency on the Cloud Dataflow Runner in your pom.xml
.
This section is not applicable to the Beam SDK for Python.
Self executing JAR
This section is not applicable to the Beam SDK for Python.
In some cases, such as starting a pipeline using a scheduler such as Apache AirFlow, you must have a self-contained application. You can pack a self-executing JAR by explicitly adding the following dependency on the Project section of your pom.xml, in addition to the adding existing dependency shown in the previous section.
Then, add the mainClass name in the Maven JAR plugin.
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>${maven-jar-plugin.version}</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<classpathPrefix>lib/</classpathPrefix>
<mainClass>YOUR_MAIN_CLASS_NAME</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
After running mvn package -Pdataflow-runner
, run ls target
and you should see (assuming your artifactId is beam-examples
and the version is 1.0.0) the following output.
To run the self-executing JAR on Cloud Dataflow, use the following command.
Pipeline options for the Cloud Dataflow Runner
When executing your pipeline with the Cloud Dataflow Runner (Java), consider these common pipeline options. When executing your pipeline with the Cloud Dataflow Runner (Python), consider these common pipeline options.
Field | Description | Default Value |
---|---|---|
runner | The pipeline runner to use. This option allows you to determine the pipeline runner at runtime. | Set to dataflow or DataflowRunner to run on the Cloud Dataflow Service. |
project | The project ID for your Google Cloud Project. | If not set, defaults to the default project in the current environment. The default project is set via gcloud . |
region | The Google Compute Engine region to create the job. | If not set, defaults to the default region in the current environment. The default region is set via gcloud . |
streaming | Whether streaming mode is enabled or disabled; true if enabled. Set to true if running pipelines with unbounded PCollection s. | false |
tempLocation
temp_location | Optional.
Required.
Path for temporary files. Must be a valid Google Cloud Storage URL that begins with gs:// .
If set, tempLocation is used as the default value for gcpTempLocation . | No default value. |
gcpTempLocation | Cloud Storage bucket path for temporary files. Must be a valid Cloud Storage URL that begins with gs:// . | If not set, defaults to the value of tempLocation , provided that tempLocation is a valid Cloud Storage URL. If tempLocation is not a valid Cloud Storage URL, you must set gcpTempLocation . |
stagingLocation
staging_location | Optional. Cloud Storage bucket path for staging your binary and any temporary files. Must be a valid Cloud Storage URL that begins with gs:// . | If not set, defaults to a staging directory within gcpTempLocation .
If not set, defaults to a staging directory within temp_location . |
save_main_session | Save the main session state so that pickled functions and classes defined in __main__ (e.g. interactive session) can be unpickled. Some workflows do not need the session state if, for instance, all of their functions/classes are defined in proper modules (not __main__ ) and the modules are importable in the worker. | false |
sdk_location | Override the default location from where the Beam SDK is downloaded. This value can be a URL, a Cloud Storage path, or a local path to an SDK tarball. Workflow submissions will download or copy the SDK tarball from this location. If set to the string default , a standard SDK location is used. If empty, no SDK is copied. | default |
See the reference documentation for the
DataflowPipelineOptions
PipelineOptions
interface (and any subinterfaces) for additional pipeline configuration options.
Additional information and caveats
Monitoring your job
While your pipeline executes, you can monitor the job’s progress, view details on execution, and receive updates on the pipeline’s results by using the Dataflow Monitoring Interface or the Dataflow Command-line Interface.
Blocking Execution
To block until your job completes, call waitToFinish
wait_until_finish
on the PipelineResult
returned from pipeline.run()
. The Cloud Dataflow Runner prints job status updates and console messages while it waits. While the result is connected to the active job, note that pressing Ctrl+C from the command line does not cancel your job. To cancel the job, you can use the Dataflow Monitoring Interface or the Dataflow Command-line Interface.
Streaming Execution
If your pipeline uses an unbounded data source or sink, you must set the streaming
option to true
.
When using streaming execution, keep the following considerations in mind.
Streaming pipelines do not terminate unless explicitly cancelled by the user. You can cancel your streaming job from the Dataflow Monitoring Interface or with the Dataflow Command-line Interface (gcloud dataflow jobs cancel command).
Streaming jobs use a Google Compute Engine machine type of
n1-standard-2
or higher by default. You must not override this, asn1-standard-2
is the minimum required machine type for running streaming jobs.Streaming execution pricing differs from batch execution.
Last updated on 2024/12/20
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!