Apache Beam Java SDK Quickstart

This Quickstart will walk you through executing your first Beam pipeline to run WordCount, written using Beam’s Java SDK, on a runner of your choice.

Set up your Development Environment

  1. Download and install the Java Development Kit (JDK) version 1.7 or later. Verify that the JAVA_HOME environment variable is set and points to your JDK installation.

  2. Download and install Apache Maven by following Maven’s installation guide for your specific operating system.

Get the WordCount Code

The easiest way to get a copy of the WordCount pipeline is to use the following command to generate a simple Maven project that contains Beam’s WordCount examples and builds against the most recent Beam release:

$ mvn archetype:generate \
      -DarchetypeGroupId=org.apache.beam \
      -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
      -DarchetypeVersion=2.1.0 \
      -DgroupId=org.example \
      -DartifactId=word-count-beam \
      -Dversion="0.1" \
      -Dpackage=org.apache.beam.examples \
      -DinteractiveMode=false

This will create a directory word-count-beam that contains a simple pom.xml and a series of example pipelines that count words in text files.

$ cd word-count-beam/

$ ls
pom.xml	src

$ ls src/main/java/org/apache/beam/examples/
DebuggingWordCount.java	WindowedWordCount.java	common
MinimalWordCount.java	WordCount.java

For a detailed introduction to the Beam concepts used in these examples, see the WordCount Example Walkthrough. Here, we’ll just focus on executing WordCount.java.

Run WordCount

A single Beam pipeline can run on multiple Beam runners, including the ApexRunner, FlinkRunner, SparkRunner or DataflowRunner. The DirectRunner is a common runner for getting started, as it runs locally on your machine and requires no specific setup.

After you’ve chosen which runner you’d like to use:

  1. Ensure you’ve done any runner-specific setup.
  2. Build your commandline by:
    1. Specifying a specific runner with --runner=<runner> (defaults to the DirectRunner)
    2. Adding any runner-specific required options
    3. Choosing input files and an output location are accessible on the chosen runner. (For example, you can’t access a local file if you are running the pipeline on an external cluster.)
  3. Run your first WordCount pipeline.
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
     -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
     -Dexec.args="--inputFile=pom.xml --output=counts --runner=ApexRunner" -Papex-runner
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
     -Dexec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -Pspark-runner
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
     -Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
                  --gcpTempLocation=gs://<your-gcs-bucket>/tmp \
                  --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
     -Pdataflow-runner

Inspect the results

Once the pipeline has completed, you can view the output. You’ll notice that there may be multiple output files prefixed by count. The exact number of these files is decided by the runner, giving it the flexibility to do efficient, distributed execution.

$ ls counts*
$ ls counts*
$ ls counts*
$ gsutil ls gs://<your-gcs-bucket>/counts*

When you look into the contents of the file, you’ll see that they contain unique words and the number of occurrences of each word. The order of elements within the file may differ because the Beam model does not generally guarantee ordering, again to allow runners to optimize for efficiency.

$ more counts*
api: 9
bundled: 1
old: 4
Apache: 2
The: 1
limitations: 1
Foundation: 1
...
$ cat counts*
BEAM: 1
have: 1
simple: 1
skip: 4
PAssert: 1
...
$ more counts*
beam: 27
SF: 1
fat: 1
job: 1
limitations: 1
require: 1
of: 11
profile: 10
...
$ gsutil cat gs://<your-gcs-bucket>/counts*
feature: 15
smother'st: 1
revelry: 1
bashfulness: 1
Bashful: 1
Below: 2
deserves: 32
barrenly: 1
...

Next Steps

Please don’t hesitate to reach out if you encounter any issues!