Apache Beam Typescript SDK quickstart

This quickstart shows you how to run an example pipeline written with the Apache Beam Typescript SDK, using the Direct Runner. The Direct Runner executes pipelines locally on your machine.

If you’re interested in contributing to the Apache Beam Typescript codebase, see the Contribution Guide.

On this page:

Set up your development environment

Make sure you have a Node.js development environment installed. If you don’t, you can download and install it from the downloads page.

Due to its extensive use of cross-language transforms, it is recommended that Python 3 and Java be available on the system as well.

Clone the GitHub repository

Clone or download the apache/beam-starter-typescript GitHub repository and change into the beam-starter-typescript directory.

git clone https://github.com/apache/beam-starter-typescript.git
cd beam-starter-typescript

Install the project dependences

Run the following command to install the project’s dependencies.

npm install

Compile the pipeline

The pipeline is then built with

npm run build

Run the quickstart

Run the following command:

node dist/src/main.js --input_text="Greetings"

The output is similar to the following:

Hello
World!
Greetings

The lines might appear in a different order.

Explore the code

The main code file for this quickstart is app.ts (GitHub). The code performs the following steps:

  1. Define a Beam pipeline that.
  1. Run the pipeline, using the Direct Runner.

Create a pipeline

A Pipeline is simply a callable that takes a single root object. The Pipeline function builds up the graph of transformations to be executed.

Create an initial PCollection

The PCollection abstraction represents a potentially distributed, multi-element data set. A Beam pipeline needs a source of data to populate an initial PCollection. The source can be bounded (with a known, fixed size) or unbounded (with unlimited size).

This example uses the Create method to create a PCollection from an in-memory array of strings. The resulting PCollection contains the strings “Hello”, “World!”, and a user-provided input string.

root.apply(beam.create(["Hello", "World!", input_text]))

Apply a transform to the PCollection

Transforms can change, filter, group, analyze, or otherwise process the elements in a PCollection. This example uses the Map transform, which maps the elements of a collection into a new collection:

.map(printAndReturn);

For convenience, PColletion has a map method, but more generally transforms are applied with .apply(someTransform()).

Run the pipeline

To run the pipeline, a runner is created (possibly with some options)

createRunner(options)

and then its run method is invoked on the pipeline callable created above.

.run(createPipeline(...));

A Beam runner runs a Beam pipeline on a specific platform. If you don’t specify a runner, the Direct Runner is the default. The Direct Runner runs the pipeline locally on your machine. It is meant for testing and development, rather than being optimized for efficiency. For more information, see Using the Direct Runner.

For production workloads, you typically use a distributed runner that runs the pipeline on a big data processing system such as Apache Flink, Apache Spark, or Google Cloud Dataflow. These systems support massively parallel processing. Different runners can be requested via the runner property on options, e.g. createRunner({runner: "dataflow"}) or createRunner({runner: "flink"}). In this example this value can be passed in via the command line as --runner=..., e.g. to run on Dataflow one would write

node dist/src/main.js \
    --runner=dataflow \
    --project=${PROJECT_ID} \
    --tempLocation=gs://${GCS_BUCKET}/wordcount-js/temp --region=${REGION}

Next Steps

Please don’t hesitate to reach out if you encounter any issues!