Apache Beam Java SDK quickstart

This quickstart shows you how to run an example pipeline written with the Apache Beam Java SDK, using the Direct Runner. The Direct Runner executes pipelines locally on your machine.

If you’re interested in contributing to the Apache Beam Java codebase, see the Contribution Guide.

On this page:

Set up your development environment

Use sdkman to install the Java Development Kit (JDK).

# Install sdkman
curl -s "https://get.sdkman.io" | bash

# Install Java 17
sdk install java 17.0.5-tem

You can use either Gradle or Apache Maven to run this quickstart:

# Install Gradle
sdk install gradle

# Install Maven
sdk install maven

Clone the GitHub repository

Clone or download the apache/beam-starter-java GitHub repository and change into the beam-starter-java directory.

git clone https://github.com/apache/beam-starter-java.git
cd beam-starter-java

Run the quickstart

Gradle: To run the quickstart with Gradle, run the following command:

gradle run --args='--inputText=Greetings'

Maven: To run the quickstart with Maven, run the following command:

mvn compile exec:java -Dexec.args=--inputText='Greetings'

The output is similar to the following:

Hello
World!
Greetings

The lines might appear in a different order.

Explore the code

The main code file for this quickstart is App.java (GitHub). The code performs the following steps:

  1. Create a Beam pipeline.
  2. Create an initial PCollection.
  3. Apply a transform to the PCollection.
  4. Run the pipeline, using the Direct Runner.

Create a pipeline

The code first creates a Pipeline object. The Pipeline object builds up the graph of transformations to be executed.

var options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
var pipeline = Pipeline.create(options);

The PipelineOptions object lets you set various options for the pipeline. The fromArgs method shown in this example parses command-line arguments, which lets you set pipeline options through the command line.

Create an initial PCollection

The PCollection abstraction represents a potentially distributed, multi-element data set. A Beam pipeline needs a source of data to populate an initial PCollection. The source can be bounded (with a known, fixed size) or unbounded (with unlimited size).

This example uses the Create.of method to create a PCollection from an in-memory array of strings. The resulting PCollection contains the strings “Hello”, “World!”, and a user-provided input string.

return pipeline
  .apply("Create elements", Create.of(Arrays.asList("Hello", "World!", inputText)))

Apply a transform to the PCollection

Transforms can change, filter, group, analyze, or otherwise process the elements in a PCollection. This example uses the MapElements transform, which maps the elements of a collection into a new collection:

.apply("Print elements",
    MapElements.into(TypeDescriptors.strings()).via(x -> {
      System.out.println(x);
      return x;
    }));

where

In this example, the mapping function is a lambda that just returns the original value. It also prints the value to System.out as a side effect.

Run the pipeline

The code shown in the previous sections defines a pipeline, but does not process any data yet. To process data, you run the pipeline:

pipeline.run().waitUntilFinish();

A Beam runner runs a Beam pipeline on a specific platform. This example uses the Direct Runner, which is the default runner if you don’t specify one. The Direct Runner runs the pipeline locally on your machine. It is meant for testing and development, rather than being optimized for efficiency. For more information, see Using the Direct Runner.

For production workloads, you typically use a distributed runner that runs the pipeline on a big data processing system such as Apache Flink, Apache Spark, or Google Cloud Dataflow. These systems support massively parallel processing.

Next Steps

Please don’t hesitate to reach out if you encounter any issues!