Using the Apache Hadoop MapReduce Runner

The Apache Hadoop MapReduce Runner can be used to execute Beam pipelines using Apache Hadoop.

The Beam Capability Matrix documents the currently supported capabilities of the Apache Hadoop MapReduce Runner.

Apache Hadoop MapReduce Runner prerequisites and setup

You need to have an Apache Hadoop environment with either Single Node Setup or Cluster Setup

The Apache Hadoop MapReduce runner currently supports Apache Hadoop version 2.8.1.

You can add a dependency on the latest version of the Apache Hadoop MapReduce runner by adding the following to your pom.xml:


Deploying Apache Hadoop MapReduce with your application

To execute in a local Hadoop environment, use this command:

$ mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
    -Pmapreduce-runner \
    -Dexec.args="--runner=MapReduceRunner \
      --inputFile=/path/to/pom.xml \
      --output=/path/to/counts \
      --fileOutputDir=<directory for intermediate outputs>"

To execute in a Hadoop cluster, package your program along with all dependencies in a fat jar.

If you are following through the Beam Java SDK Quickstart, you can run this command:

$ mvn package -Pflink-runner

For actually running the pipeline you would use this command

$ yarn jar word-count-beam-bundled-0.1.jar \
    org.apache.beam.examples.WordCount \
    --runner=MapReduceRunner \
    --inputFile=/path/to/pom.xml \
      --output=/path/to/counts \
      --fileOutputDir=<directory for intermediate outputs>"

Pipeline options for the Apache Hadoop MapReduce Runner

When executing your pipeline with the Apache Hadoop MapReduce Runner, you should consider the following pipeline options.

Field Description Default Value
runner The pipeline runner to use. This option allows you to determine the pipeline runner at runtime. Set to MapReduceRunner to run using Apache Hadoop MapReduce.
jarClass The jar class of the user Beam program. JarClassInstanceFactory.class
fileOutputDir The directory for output files. "/tmp/mapreduce/"