Apache Beam Java SDK Quickstart

This Quickstart will walk you through executing your first Beam pipeline to run WordCount, written using Beam’s Java SDK, on a runner of your choice.

If you’re interested in contributing to the Apache Beam Java codebase, see the Contribution Guide.

Set up your Development Environment

  1. Download and install the Java Development Kit (JDK) version 8. Verify that the JAVA_HOME environment variable is set and points to your JDK installation.

  2. Download and install Apache Maven by following Maven’s installation guide for your specific operating system.

  3. Optional: Install Gradle if you would like to convert your Maven project into Gradle.

Get the WordCount Code

The easiest way to get a copy of the WordCount pipeline is to use the following command to generate a simple Maven project that contains Beam’s WordCount examples and builds against the most recent Beam release:

$ mvn archetype:generate \
      -DarchetypeGroupId=org.apache.beam \
      -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
      -DarchetypeVersion=2.27.0 \
      -DgroupId=org.example \
      -DartifactId=word-count-beam \
      -Dversion="0.1" \
      -Dpackage=org.apache.beam.examples \
      -DinteractiveMode=false
PS> mvn archetype:generate `
 -D archetypeGroupId=org.apache.beam `
 -D archetypeArtifactId=beam-sdks-java-maven-archetypes-examples `
 -D archetypeVersion=2.27.0 `
 -D groupId=org.example `
 -D artifactId=word-count-beam `
 -D version="0.1" `
 -D package=org.apache.beam.examples `
 -D interactiveMode=false

This will create a directory word-count-beam that contains a simple pom.xml and a series of example pipelines that count words in text files.

$ cd word-count-beam/

$ ls
pom.xml	src

$ ls src/main/java/org/apache/beam/examples/
DebuggingWordCount.java	WindowedWordCount.java	common
MinimalWordCount.java	WordCount.java
PS> cd .\word-count-beam

PS> dir

...

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-----        7/19/2018  11:00 PM                src
-a----        7/19/2018  11:00 PM          16051 pom.xml

PS> dir .\src\main\java\org\apache\beam\examples

...
Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-----        7/19/2018  11:00 PM                common
d-----        7/19/2018  11:00 PM                complete
d-----        7/19/2018  11:00 PM                subprocess
-a----        7/19/2018  11:00 PM           7073 DebuggingWordCount.java
-a----        7/19/2018  11:00 PM           5945 MinimalWordCount.java
-a----        7/19/2018  11:00 PM           9490 WindowedWordCount.java
-a----        7/19/2018  11:00 PM           7662 WordCount.java

For a detailed introduction to the Beam concepts used in these examples, see the WordCount Example Walkthrough. Here, we’ll just focus on executing WordCount.java.

Optional: Convert from Maven to Gradle Project

Ensure you are in the same directory as the pom.xml file generated from the previous step. Automatically convert your project from Maven to Gradle by running:

$ gradle init

After you have converted the project to Gradle:

  1. Edit the generated build.gradle file by adding mavenCentral() under repositories:
    repositories {
        mavenCentral()
        maven {
            url = uri('https://repository.apache.org/content/repositories/snapshots/')
        }
    
        maven {
            url = uri('http://repo.maven.apache.org/maven2')
        }
    }
  2. Add the following task in build.gradle to allow you to execute pipelines with Gradle:
    task execute (type:JavaExec) {
        main = System.getProperty("mainClass")
        classpath = sourceSets.main.runtimeClasspath
        systemProperties System.getProperties()
        args System.getProperty("exec.args", "").split()
    }
  3. Rebuild your project by running:
    $ gradle build

Run WordCount

A single Beam pipeline can run on multiple Beam runners, including the FlinkRunner, SparkRunner, NemoRunner, JetRunner, or DataflowRunner. The DirectRunner is a common runner for getting started, as it runs locally on your machine and requires no specific setup.

After you’ve chosen which runner you’d like to use:

  1. Ensure you’ve done any runner-specific setup.
  2. Build your commandline by:
    1. Specifying a specific runner with --runner=<runner> (defaults to the DirectRunner)
    2. Adding any runner-specific required options
    3. Choosing input files and an output location are accessible on the chosen runner. (For example, you can’t access a local file if you are running the pipeline on an external cluster.)
  3. Run your first WordCount pipeline.

Run WordCount Using Maven

For Unix shells:

$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
     -Dexec.args="--inputFile=/path/to/inputfile --output=counts" -Pdirect-runner
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
     -Dexec.args="--runner=SparkRunner --inputFile=/path/to/inputfile --output=counts" -Pspark-runner
Make sure you complete the setup steps at /documentation/runners/dataflow/#setup

$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
     -Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
                  --region=<your-gcp-region> \
                  --gcpTempLocation=gs://<your-gcs-bucket>/tmp \
                  --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
     -Pdataflow-runner
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
     -Dexec.args="--inputFile=/path/to/inputfile --output=/tmp/counts --runner=SamzaRunner" -Psamza-runner
$ mvn package -Pnemo-runner && java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount \
     --runner=NemoRunner --inputFile=`pwd`/pom.xml --output=counts
$ mvn package -Pjet-runner
$ java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount \
     --runner=JetRunner --jetLocalMode=3 --inputFile=`pwd`/pom.xml --output=counts

For Windows PowerShell:

PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount `
 -D exec.args="--inputFile=/path/to/inputfile --output=counts" -P direct-runner
PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount `
 -D exec.args="--runner=SparkRunner --inputFile=/path/to/inputfile --output=counts" -P spark-runner
Make sure you complete the setup steps at /documentation/runners/dataflow/#setup

PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount `
 -D exec.args="--runner=DataflowRunner --project=<your-gcp-project> `
               --region=<your-gcp-region> \
               --gcpTempLocation=gs://<your-gcs-bucket>/tmp `
               --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" `
 -P dataflow-runner
PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount `
     -D exec.args="--inputFile=/path/to/inputfile --output=/tmp/counts --runner=SamzaRunner" -P samza-runner
PS> mvn package -P nemo-runner -DskipTests
PS> java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount `
      --runner=NemoRunner --inputFile=`pwd`/pom.xml --output=counts
PS> mvn package -P jet-runner
PS> java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount `
      --runner=JetRunner --jetLocalMode=3 --inputFile=$pwd/pom.xml --output=counts

Run WordCount Using Gradle

For Unix shells (Instructions currently only available for Direct, Spark, and Dataflow):

$ gradle clean execute -DmainClass=org.apache.beam.examples.WordCount \
    -Dexec.args="--inputFile=/path/to/inputfile --output=counts" -Pdirect-runner
We are working on adding the instruction for this runner!
$ gradle clean execute -DmainClass=org.apache.beam.examples.WordCount \
    -Dexec.args="--inputFile=/path/to/inputfile --output=counts" -Pspark-runner
$ gradle clean execute -DmainClass=org.apache.beam.examples.WordCount \
    -Dexec.args="--project=<your-gcp-project> --inputFile=gs://apache-beam-samples/shakespeare/* \
    --output=gs://<your-gcs-bucket>/counts" -Pdataflow-runner
We are working on adding the instruction for this runner!
We are working on adding the instruction for this runner!
We are working on adding the instruction for this runner!

Inspect the results

Once the pipeline has completed, you can view the output. You’ll notice that there may be multiple output files prefixed by count. The exact number of these files is decided by the runner, giving it the flexibility to do efficient, distributed execution.

$ ls counts*
$ ls counts*
$ gsutil ls gs://<your-gcs-bucket>/counts*
$ ls /tmp/counts*
$ ls counts*
$ ls counts*

When you look into the contents of the file, you’ll see that they contain unique words and the number of occurrences of each word. The order of elements within the file may differ because the Beam model does not generally guarantee ordering, again to allow runners to optimize for efficiency.

$ more counts*
api: 9
bundled: 1
old: 4
Apache: 2
The: 1
limitations: 1
Foundation: 1
...
$ more counts*
beam: 27
SF: 1
fat: 1
job: 1
limitations: 1
require: 1
of: 11
profile: 10
...
$ gsutil cat gs://<your-gcs-bucket>/counts*
feature: 15
smother'st: 1
revelry: 1
bashfulness: 1
Bashful: 1
Below: 2
deserves: 32
barrenly: 1
...
$ more /tmp/counts*
api: 7
are: 2
can: 2
com: 14
end: 14
for: 14
has: 2
...
$ more counts*
cluster: 2
handler: 1
plugins: 9
exclusions: 14
finalName: 2
Adds: 2
java: 7
xml: 1
...
$ more counts*
FlinkRunner: 1
cleanupDaemonThreads: 2
sdks: 4
unit: 1
Apache: 3
IO: 2
copyright: 1
governing: 1
overrides: 1
YARN: 1
...

Next Steps

Please don’t hesitate to reach out if you encounter any issues!