Using the Apache Samza Runner

The Apache Samza Runner can be used to execute Beam pipelines using Apache Samza. The Samza Runner executes Beam pipeline in a Samza application and can run locally. The application can further be built into a .tgz file, and deployed to a YARN cluster or Samza standalone cluster with Zookeeper.

The Samza Runner and Samza are suitable for large scale, stateful streaming jobs, and provide:

The Beam Capability Matrix documents the currently supported capabilities of the Samza Runner.

Samza Runner prerequisites and setup

The Samza Runner is built on Samza version greater than 0.14.1.

Specify your dependency

You can specify your dependency on the Samza Runner by adding the following to your pom.xml:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-runners-samza</artifactId>
  <version>2.8.0</version>
  <scope>runtime</scope>
</dependency>

<!-- Samza dependencies -->
<dependency>
  <groupId>org.apache.samza</groupId>
  <artifactId>samza-api</artifactId>
  <version>${samza.version}</version>
</dependency>

<dependency>
  <groupId>org.apache.samza</groupId>
  <artifactId>samza-core_2.11</artifactId>
  <version>${samza.version}</version>
</dependency>

<dependency>
  <groupId>org.apache.samza</groupId>
  <artifactId>samza-kafka_2.11</artifactId>
  <version>${samza.version}</version>
  <scope>runtime</scope>
</dependency>

<dependency>
  <groupId>org.apache.samza</groupId>
  <artifactId>samza-kv_2.11</artifactId>
  <version>${samza.version}</version>
  <scope>runtime</scope>
</dependency>

<dependency>
  <groupId>org.apache.samza</groupId>
  <artifactId>samza-kv-rocksdb_2.11</artifactId>
  <version>${samza.version}</version>
  <scope>runtime</scope>
</dependency>
    

Executing a pipeline with Samza Runner

If you run your pipeline locally or deploy it to a standalone cluster with all the jars and resource files, no packaging is required. For example, the following command runs the WordCount example:

$ mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
    -Psamza-runner \
    -Dexec.args="--runner=SamzaRunner \
      --inputFile=/path/to/input \
      --output=/path/to/counts"

To deploy your pipeline to a YARN cluster, here is the instructions of deploying a sample Samza job. First you need to package your application jars and resource files into a .tgz archive file, and make it available to download for Yarn containers. In your config, you need to specify the URI of this TGZ file location:

yarn.package.path=${your_job_tgz_URI}

job.name=${your_job_name}
job.factory.class=org.apache.samza.job.yarn.YarnJobFactory
job.coordinator.system=${job_coordinator_system}
job.default.system=${job_default_system}

For more details on the configuration, see Samza Configuration Reference.

The config file will be passed in by setting the command line arg --configFilePath=/path/to/config.properties. With that, you can run your main class of Beam pipeline in a Yarn Resource Manager, and the Samza Runner will submit a Yarn job under the hood.

Pipeline options for the Samza Runner

When executing your pipeline with the Samza Runner, you can use the following pipeline options.

Field Description Default Value
runner The pipeline runner to use. This option allows you to determine the pipeline runner at runtime. Set to SamzaRunner to run using Samza.
configFilePath The config for Samza using a properties file. empty, i.e. use local execution.
configOverride The config override to set programmatically. empty, i.e. use config file or local execution.
watermarkInterval The interval to check for watermarks in milliseconds. 1000
systemBufferSize The maximum number of messages to buffer for a given system. 5000
maxSourceParallelism The maximum parallelism allowed for any data source. 1
storeBatchGetSize The batch get size limit for the state store. 10000

Monitoring your job

You can monitor your pipeline job using metrics emitted from both Beam and Samza, e.g. Beam source metrics such as elements_read and backlog_elements, and Samza job metrics such as job-healthy and process-envelopes. A complete list of Samza metrics is in Samza Metrics Reference. You can view your job’s metrics via JMX in development, and send the metrics to graphing system such as Graphite. For more details, please see Samza Metrics.

For a running Samza YARN job, you can use YARN web UI to monitor the job status and check logs.