Using the Apache Hadoop MapReduce Runner

The Apache Hadoop MapReduce Runner can be used to execute Beam pipelines using Apache Hadoop.

The Beam Capability Matrix documents the currently supported capabilities of the Apache Hadoop MapReduce Runner.

Apache Hadoop MapReduce Runner prerequisites and setup

You need to have an Apache Hadoop environment with either Single Node Setup or Cluster Setup

The Apache Hadoop MapReduce runner currently supports Apache Hadoop version 2.8.1.

You can add a dependency on the latest version of the Apache Hadoop MapReduce runner by adding the following to your pom.xml:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-runners-mapreduce</artifactId>
  <version>2.55.1</version>
</dependency>

Deploying Apache Hadoop MapReduce with your application

To execute in a local Hadoop environment, use this command:

$ mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
    -Pmapreduce-runner \
    -Dexec.args="--runner=MapReduceRunner \
      --inputFile=/path/to/pom.xml \
      --output=/path/to/counts \
      --fileOutputDir=<directory for intermediate outputs>"

To execute in a Hadoop cluster, package your program along with all dependencies in a fat jar.

If you are following through the Beam Java SDK Quickstart, you can run this command:

$ mvn package -Pflink-runner

For actually running the pipeline you would use this command

$ yarn jar word-count-beam-bundled-0.1.jar \
    org.apache.beam.examples.WordCount \
    --runner=MapReduceRunner \
    --inputFile=/path/to/pom.xml \
      --output=/path/to/counts \
      --fileOutputDir=<directory for intermediate outputs>"

Pipeline options for the Apache Hadoop MapReduce Runner

When executing your pipeline with the Apache Hadoop MapReduce Runner, you should consider the following pipeline options.

Field	Description	Default Value
`runner`	The pipeline runner to use. This option allows you to determine the pipeline runner at runtime.	Set to `MapReduceRunner` to run using Apache Hadoop MapReduce.
`jarClass`	The jar class of the user Beam program.	JarClassInstanceFactory.class
`fileOutputDir`	The directory for output files.	"/tmp/mapreduce/"

Last updated on 2024/04/25

Have you found everything you were looking for?

Was it all useful and clear? Is there anything that you would like to change? Let us know!