Apache Beam Python SDK quickstart

This quickstart shows you how to run an example pipeline written with the Apache Beam Python SDK, using the Direct Runner. The Direct Runner executes pipelines locally on your machine.

If you’re interested in contributing to the Apache Beam Python codebase, see the Contribution Guide.

On this page:

Set up your development environment

Apache Beam aims to work on released Python versions that have not yet reached end of life, but it may take a few releases until Apache Beam fully supports the most recently released Python minor version.

The minimum required Python version is listed in the Meta section of the apache-beam project page under Requires. The list of all supported Python versions is listed in the Classifiers section at the bottom of the page, under Programming Language.

Check your Python version by running:

python3 --version

If you don’t have a Python interpreter, you can download and install it from the Python downloads page.

If you need to install a different version of Python in addition to the version that you already have, you can find some recommendations in our Developer Wiki.

Clone the GitHub repository

Clone or download the apache/beam-starter-python GitHub repository and change into the beam-starter-python directory.

git clone https://github.com/apache/beam-starter-python.git
cd beam-starter-python

Create and activate a virtual environment

A virtual environment is a directory tree containing its own Python distribution. We recommend using a virtual environment so that all dependencies of your project are installed in an isolated and self-contained environment. To set up a virtual environment, run the following commands:

# Create a new Python virtual environment.
python3 -m venv env

# Activate the virtual environment.
source env/bin/activate

If these commands do not work on your platform, see the venv documentation.

Install the project dependences

Run the following command to install the project’s dependencies from the requirements.txt file:

pip install -e .

Run the quickstart

Run the following command:

python main.py --input-text="Greetings"

The output is similar to the following:

Hello
World!
Greetings

The lines might appear in a different order.

Run the following command to deactivate the virtual environment:

deactivate

Explore the code

The main code file for this quickstart is app.py (GitHub). The code performs the following steps:

Create a Beam pipeline.
Create an initial PCollection.
Apply a transform to the PCollection.
Run the pipeline, using the Direct Runner.

Create a pipeline

The code first creates a Pipeline object. The Pipeline object builds up the graph of transformations to be executed.

with beam.Pipeline(options=beam_options) as pipeline:

The beam_option variable shown here is a PipelineOptions object, which is used to set options for the pipeline. For more information, see Configuring pipeline options.

Create an initial PCollection

The PCollection abstraction represents a potentially distributed, multi-element data set. A Beam pipeline needs a source of data to populate an initial PCollection. The source can be bounded (with a known, fixed size) or unbounded (with unlimited size).

This example uses the Create method to create a PCollection from an in-memory array of strings. The resulting PCollection contains the strings “Hello”, “World!”, and a user-provided input string.

pipeline
| "Create elements" >> beam.Create(["Hello", "World!", input_text])

Note: The pipe operator | is used to chain transforms.

Apply a transform to the PCollection

Transforms can change, filter, group, analyze, or otherwise process the elements in a PCollection. This example uses the Map transform, which maps the elements of a collection into a new collection:

| "Print elements" >> beam.Map(print)

Run the pipeline

To run the pipeline, you can call the Pipeline.run method:

pipeline.run.wait_until_finish()

However, by enclosing the Pipeline object inside a with statement, the run method is automatically invoked.

with beam.Pipeline(options=beam_options) as pipeline:
    # ...
    # run() is called automatically

A Beam runner runs a Beam pipeline on a specific platform. If you don’t specify a runner, the Direct Runner is the default. The Direct Runner runs the pipeline locally on your machine. It is meant for testing and development, rather than being optimized for efficiency. For more information, see Using the Direct Runner.

For production workloads, you typically use a distributed runner that runs the pipeline on a big data processing system such as Apache Flink, Apache Spark, or Google Cloud Dataflow. These systems support massively parallel processing.

Next Steps

Learn more about the Beam SDK for Python and look through the Python SDK API reference.
Take a self-paced tour through our Learning Resources.
Dive in to some of our favorite Videos and Podcasts.
Join the Beam users@ mailing list.

Please don’t hesitate to reach out if you encounter any issues!

Last updated on 2024/07/26

Have you found everything you were looking for?

Was it all useful and clear? Is there anything that you would like to change? Let us know!