Using the Apache Nemo Runner

The Apache Nemo Runner can be used to execute Beam pipelines using Apache Nemo. The Nemo Runner can optimize Beam pipelines with the Nemo compiler through various optimization passes and execute them in a distributed fashion using the Nemo runtime. You can also deploy a self-contained application for local mode or run using resource managers like YARN or Mesos.

The Nemo Runner executes Beam pipelines on top of Apache Nemo, providing:

Batch and streaming pipelines
Fault-tolerance
Integration with YARN and other components of the Apache Hadoop ecosystem
Support for the various optimizations provided by the Nemo optimizer

The Beam Capability Matrix documents the supported capabilities of the Nemo Runner.

Nemo Runner prerequisites and setup

The Nemo Runner can be used simply by adding a dependency on a version of the Nemo runner newer than 0.1 to your pom.xml as follows:

<dependency>
    <groupId>org.apache.nemo</groupId>
    <artifactId>nemo-compiler-frontend-beam</artifactId>
    <version>${nemo.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>${hadoop.version}</version>
    <exclusions>
        <exclusion>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
        </exclusion>
    </exclusions>
</dependency>

Deploying Nemo with your Application

A self-contained application might be easier to manage and allows you to fully use the functionality that Nemo provides. Simply add the dependency shown above and shade the application JAR using the Maven Shade plugin:

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-shade-plugin</artifactId>
  <configuration>
    <createDependencyReducedPom>false</createDependencyReducedPom>
    <filters>
      <filter>
        <artifact>*:*</artifact>
        <excludes>
          <exclude>META-INF/*.SF</exclude>
          <exclude>META-INF/*.DSA</exclude>
          <exclude>META-INF/*.RSA</exclude>
        </excludes>
      </filter>
    </filters>
  </configuration>
  <executions>
    <execution>
      <phase>package</phase>
      <goals>
        <goal>shade</goal>
      </goals>
      <configuration>
        <shadedArtifactAttached>true</shadedArtifactAttached>
        <shadedClassifierName>shaded</shadedClassifierName>
        <transformers>
          <transformer
            implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
        </transformers>
      </configuration>
    </execution>
  </executions>
</plugin>

After running mvn package, run ls target and you should see the following output (in this example, your artifactId is beam-examples and the version is 1.0.0):

beam-examples-1.0.0-shaded.jar

With this shaded jar, you can use the bin/run_beam.sh shell script as follows:

## MapReduce example
./bin/run_beam.sh \
    -job_id mr_default \
    -user_main org.apache.nemo.examples.beam.WordCount \
    -user_args "`pwd`/examples/resources/test_input_wordcount `pwd`/examples/resources/test_output_wordcount"

To use Nemo using YARN, set the -deploy_mode flag on Nemo to yarn.

More instructions can be seen on the README of the Apache Nemo GitHub.

Pipeline Options for the Nemo Runner

When executing your pipeline with the Nemo Runner, you should consider the following pipeline options:

Field	Description	Default Value
`runner`	The pipeline runner to use. This option allows you to determine the pipeline runner at runtime.	Set to `NemoRunner` to run using Nemo.
`maxBundleSize`	The maximum number of elements in a bundle.	1000
`maxBundleTimeMillis`	The maximum time to wait before finalizing a bundle (in milliseconds).	1000

More options are to be added to the list, to fully support the various options that Nemo supports.

Additional Information and Caveats

Using the Run_beam.sh script

When submitting a Nemo application to the cluster, it is common to use the bin/run_beam.sh script that is provided within the Nemo installation. The script also provides a richer set of options that you can pass on to configure various actions of Nemo. Please refer to the Apache Nemo GitHub README for more information.

Monitoring your job

You can monitor a running Nemo job using the Nemo WebUI. The docs are currently being updated, but more information can be found on the Apache Nemo GitHub README.

Streaming Execution

Add the options -scheduler_impl_class_name org.apache.nemo.runtime.master.scheduler.StreamingScheduler and -optimization_policy org.apache.nemo.compiler.optimizer.policy.StreamingPolicy to set the Nemo Runner to streaming mode. Also, be sure to extend the capacity of the resources in the resources.json, for example:

{
  "type": "Reserved",
  "memory_mb": 2048,
  "capacity": 50000
}

Please refer to the Apache Nemo GitHub README for more information.

Last updated on 2025/07/08

Have you found everything you were looking for?

Was it all useful and clear? Is there anything that you would like to change? Let us know!