Using the Google Cloud Dataflow Runner

The Google Cloud Dataflow Runner uses the Cloud Dataflow managed service. When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable code and dependencies to a Google Cloud Storage bucket and creates a Cloud Dataflow job, which executes your pipeline on managed resources in Google Cloud Platform.

The Cloud Dataflow Runner and service are suitable for large scale, continuous jobs, and provide:

The Beam Capability Matrix documents the supported capabilities of the Cloud Dataflow Runner.

Cloud Dataflow Runner prerequisites and setup

To use the Cloud Dataflow Runner, you must complete the following setup:

  1. Select or create a Google Cloud Platform Console project.

  2. Enable billing for your project.

  3. Enable required Google Cloud APIs: Cloud Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, and Cloud Storage JSON. You may need to enable additional APIs (such as BigQuery, Cloud Pub/Sub, or Cloud Datastore) if you use them in your pipeline code.

  4. Install the Google Cloud SDK.

  5. Create a Cloud Storage bucket.

    • In the Google Cloud Platform Console, go to the Cloud Storage browser.
    • Click Create bucket.
    • In the Create bucket dialog, specify the following attributes:
      • Name: A unique bucket name. Do not include sensitive information in the bucket name, as the bucket namespace is global and publicly visible.
      • Storage class: Multi-Regional
      • Location: Choose your desired location
    • Click Create.

For more information, see the Before you begin section of the Cloud Dataflow quickstarts.

Specify your dependency

You must specify your dependency on the Cloud Dataflow Runner.

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
  <version>0.5.0</version>
  <scope>runtime</scope>
</dependency>

Authentication

Before running your pipeline, you must authenticate with the Google Cloud Platform. Run the following command to get Application Default Credentials.

gcloud auth application-default login

Pipeline options for the Cloud Dataflow Runner

When executing your pipeline with the Cloud Dataflow Runner, set these pipeline options.

Field Description Default Value
runner The pipeline runner to use. This option allows you to determine the pipeline runner at runtime. Set to dataflow to run on the Cloud Dataflow Service.
project The project ID for your Google Cloud Project. If not set, defaults to the default project in the current environment. The default project is set via gcloud.
streaming Whether streaming mode is enabled or disabled; true if enabled. Set to true if running pipelines with unbounded PCollections. false
tempLocation Optional. Path for temporary files. If set to a valid Google Cloud Storage URL that begins with gs://, tempLocation is used as the default value for gcpTempLocation. No default value.
gcpTempLocation Cloud Storage bucket path for temporary files. Must be a valid Cloud Storage URL that begins with gs://. If not set, defaults to the value of tempLocation, provided that tempLocation is a valid Cloud Storage URL. If tempLocation is not a valid Cloud Storage URL, you must set gcpTempLocation.
stagingLocation Optional. Cloud Storage bucket path for staging your binary and any temporary files. Must be a valid Cloud Storage URL that begins with gs://. If not set, defaults to a staging directory within gcpTempLocation.

See the reference documentation for the DataflowPipelineOptionsPipelineOptions interface (and its subinterfaces) for the complete list of pipeline configuration options.

Additional information and caveats

Monitoring your job

While your pipeline executes, you can monitor the job’s progress, view details on execution, and receive updates on the pipeline’s results by using the Dataflow Monitoring Interface or the Dataflow Command-line Interface.

Blocking Execution

To connect to your job and block until it is completed, call waitToFinish on the PipelineResult returned from pipeline.run(). The Cloud Dataflow Runner prints job status updates and console messages while it waits. While the result is connected to the active job, note that pressing Ctrl+C from the command line does not cancel your job. To cancel the job, you can use the Dataflow Monitoring Interface or the Dataflow Command-line Interface.

Streaming Execution

If your pipeline uses an unbounded data source or sink, you must set the streaming option to true.