This guide shows you how to set up your Python development environment, get the Apache Beam SDK for Python, and run an example pipeline.
The Beam SDK for Python requires Python version 2.7.x. Check that you have version 2.7.x by running:
Install pip, Python’s package manager. Check that you have version 7.0.0 or newer by running:
It is recommended that you install a Python virtual environment
for initial experiments. If you do not have
virtualenv version 13.1.0 or newer, install it by running:
pip install --upgrade virtualenv
If you do not want to use a Python virtual environment (not recommended), ensure
setuptools is installed on your machine. If you do not have
setuptools version 17.1 or newer, install it by running:
pip install --upgrade setuptools
A virtual environment is a directory tree containing its own Python distribution. To create a virtual environment, create a directory and run:
A virtual environment needs to be activated for each shell that is to use it. Activating it sets some environment variables that point to the virtual environment’s directories.
To activate a virtual environment in Bash, run:
That is, source the script
bin/activate under the virtual environment directory you created.
For instructions using other shells, see the virtualenv documentation.
Install the latest Python SDK from PyPI:
pip install apache-beam
The above installation will not install all the extra dependencies for using features like the Google Cloud Dataflow runner. Information on what extra packages are required for different features are highlighted below. It is possible to install multitple extra requirements using something like
pip install apache-beam[feature1, feature2].
pip install apache-beam[gcp]
pip install apache-beam[test]
pip install apache-beam[docs]
The Apache Beam examples directory has many examples. All examples can be run locally by passing the required arguments described in the example script.
For example, to run
python -m apache_beam.examples.wordcount --input <PATH_TO_INPUT_FILE> --output counts
# As part of the initial setup, install Google Cloud Platform specific extra components. pip install apache-beam[gcp] python -m apache_beam.examples.wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \ --output gs://<your-gcs-bucket>/counts \ --runner DataflowRunner \ --project your-gcp-project \ --temp_location gs://<your-gcs-bucket>/tmp/
Please don’t hesitate to reach out if you encounter any issues!