The easiest way to get a copy of the WordCount pipeline is to use the following command to generate a simple Maven project that contains Beam’s WordCount examples and builds against the most recent Beam release:
$ mvn archetype:generate \ -DarchetypeGroupId=org.apache.beam \ -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \ -DarchetypeVersion=2.0.0 \ -DgroupId=org.example \ -DartifactId=word-count-beam \ -Dversion="0.1" \ -Dpackage=org.apache.beam.examples \ -DinteractiveMode=false
This will create a directory
word-count-beam that contains a simple
pom.xml and a series of example pipelines that count words in text files.
$ cd word-count-beam/ $ ls pom.xml src $ ls src/main/java/org/apache/beam/examples/ DebuggingWordCount.java WindowedWordCount.java common MinimalWordCount.java WordCount.java
For a detailed introduction to the Beam concepts used in these examples, see the WordCount Example Walkthrough. Here, we’ll just focus on executing
A single Beam pipeline can run on multiple Beam runners, including the ApexRunner, FlinkRunner, SparkRunner or DataflowRunner. The DirectRunner is a common runner for getting started, as it runs locally on your machine and requires no specific setup.
After you’ve chosen which runner you’d like to use:
--runner=<runner>(defaults to the DirectRunner)
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=counts --runner=ApexRunner" -Papex-runner
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -Pspark-runner
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \ --gcpTempLocation=gs://<your-gcs-bucket>/tmp \ --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \ -Pdataflow-runner
Once the pipeline has completed, you can view the output. You’ll notice that there may be multiple output files prefixed by
count. The exact number of these files is decided by the runner, giving it the flexibility to do efficient, distributed execution.
$ ls counts*
$ ls counts*
$ ls counts*
$ gsutil ls gs://<your-gcs-bucket>/counts*
When you look into the contents of the file, you’ll see that they contain unique words and the number of occurrences of each word. The order of elements within the file may differ because the Beam model does not generally guarantee ordering, again to allow runners to optimize for efficiency.
$ more counts* api: 9 bundled: 1 old: 4 Apache: 2 The: 1 limitations: 1 Foundation: 1 ...
$ cat counts* BEAM: 1 have: 1 simple: 1 skip: 4 PAssert: 1 ...
$ more counts* beam: 27 SF: 1 fat: 1 job: 1 limitations: 1 require: 1 of: 11 profile: 10 ...
$ gsutil cat gs://<your-gcs-bucket>/counts* feature: 15 smother'st: 1 revelry: 1 bashfulness: 1 Bashful: 1 Below: 2 deserves: 32 barrenly: 1 ...
Please don’t hesitate to reach out if you encounter any issues!