Apache Beam Contribution Guide

The Apache Beam community welcomes contributions from anyone!

There are lots of opportunities:

Most importantly, if you have an idea of how to contribute, then do it!

For a list of open starter tasks, check https://s.apache.org/beam-starter-tasks.

Contributing code

Discussons about contributing code to beam happens on the dev@ mailing list. Introduce yourself!

Questions can be asked on the #beam channel of the ASF slack. Introduce yourself!

Coding happens at https://github.com/apache/beam. To contribute, follow the usual GitHub process: fork the repo, make your changes, and open a pull request and @mention a reviewer. If you have more than one commit in your change, you many be asked to rebase and squash the commits. If you are unfamiliar with this workflow, GitHub maintains these helpful guides:

If your change is large or it is your first change, it is a good idea to discuss it on the dev@ mailing list

For large changes (you may be asked to create a design doc (template, examples)).

Documentation happens at https://github.com/apache/beam-site and contributions are welcome.

Large contributions require a signed Individual Contributor License Agreement (ICLA) to the Apache Software Foundation (ASF).

If you are contributing a PTransform to Beam, we have an extensive PTransform Style Guide.

Building & Testing

We use the Gradle Build Tool.

You do not need to install Gradle, but you do need a Java SDK installed. You can develop on Linux, macOS, or Microsoft Windows. There have been issues noted when developing using Windows; feel free to contribute fixes to make it easier.

Familiarize yourself with the project structure. At the root of the git repository, run:

$ ./gradlew projects

Run the entire set of tests with:

$ ./gradlew check

You can limit testing to a particular module. Gradle will build just the necessary things to run those tests. For example:

$ ./gradlew -p sdks/go check
$ ./gradlew -p sdks/java/io/cassandra check
$ ./gradlew -p runners/flink check

Examine the available tasks in a project. For the default set of tasks, use:

$ ./gradlew tasks

For a given module, use:

$ ./gradlew sdks/java/io/cassandra tasks

For an exhaustive list of tasks, use:

$ ./gradlew tasks --all

We run integration and performance test using Jenkins. The job definitions are available in the Beam GitHub repository.

Developing with an IDE

Generate an IDEA project .ipr file with:

$ ./gradlew idea

Pull requests

When your change is ready to be reviewed and merged, create a pull request. Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue. This will automatically link the pull request to the issue.

Pull requests can only be merged by a beam committer. To find a committer for your area, look for similar code merges or ask on dev@beam.apache.org

Use @mention in the pull request to notify the reviewer.

The pull request and any changes pushed to it will trigger precommit jobs. If a test fails and appears unrelated to your change, you can cause tests to be re-run by adding a single line comment on your PR

 retest this please

There are other trigger phrases for post-commit tests found in .testinfra/jenkins, but use these sparingly because postcommit tests consume shared development resources.

Developing with the Python SDK

Gradle can build and test python, and is used by the Jenkins jobs, so needs to be maintained.

You can directly use the Python toolchain instead of having Gradle orchestrate it, which may be faster for you, but it is your preference. If you do want to use Python tools directly, we recommend setting up a virtual environment before testing your code.

If you update any of the cythonized files in Python SDK, you must install the cython package before running following command to properly test your code.

The following commands should be run in the sdks/python directory. This installs Python from source and includes the test and gcp dependencies.

On macOS/Linux:

$ virtualenv env
$ . ./env/bin/activate
(env) $ pip install -e .[gcp,test]

On Windows:

> c:\Python27\python.exe -m virtualenv
> env\Scripts\activate
(env) > pip install -e .[gcp,test]

This command runs all Python tests. The nose dependency is installed by [test] in pip install.

(env) $ python setup.py nosetests

You can use following command to run a single test method.

(env) $ python setup.py nosetests --tests <module>:<test class>.<test method>

For example:
(env) $ python setup.py nosetests --tests apache_beam.io.textio_test:TextSourceTest.test_progress

You can deactivate the virtualenv when done.

(env) $ deactivate

To check just for Python lint errors, run the following command.

$ ../../gradlew lint

Or use tox commands to run the lint tasks:

$ tox -e py27-lint    # For python 2.7
$ tox -e py3-lint     # For python 3
$ tox -e py27-lint3   # For python 2-3 compatibility

Remote testing

This step is only required for testing SDK code changes remotely (not using directrunner). In order to do this you must build the Beam tarball. From the root of the git repository, run:

$ cd sdks/python/
$ python setup.py sdist

Pass the --sdk_location flag to use the newly built version. For example:

$ python setup.py sdist > /dev/null && \
    python -m apache_beam.examples.wordcount ... \
        --sdk_location dist/apache-beam-2.5.0.dev0.tar.gz

Contributing to the website

The Beam website is in the Beam Site GitHub mirror repository in the asf-site branch (not master). The README there explains how to modify different parts of the site. The GitHub workflow is the same - make your change and open a pull request.

Issues are tracked in the website component in JIRA.

Works in progress

A great way to contribute is to join an existing effort. There are many works in progress, some on branches because they are very incomplete.

Portability Framework

The primary Beam vision: Any SDK on any runner. This is a cross-cutting effort across Java, Python, and Go, and every Beam runner.

Apache Spark 2.0 Runner

JStorm Runner

MapReduce Runner

Tez Runner


Python 3 Support

Work is in progress to add Python 3 support to Beam. Current goal is to make Beam codebase compatible both with Python 2.7 and Python 3.4.

Contributions are welcome! If you are interested to help, you can select a subpackage to port and assign yourself the corresponding issue. Comment on the issue if you cannot assign it yourself. When submitting a new PR, please tag @RobbeSneyders, @aaltay, and @tvalentyn.

Next Java LTS version support (Java 11 / 18.9)

Work to support the next LTS release of Java is in progress. For more details about the scope and info on the various tasks please see the JIRA ticket.

IO Performance Testing

We are also working on writing Performance Tests for IOs and developing a Performance Testing Framework for them. Contributions are welcome in the following areas:

See the documentation and the initial proposal(for file based tests).

If you’re willing to help in this area, tag the following people in PRs: @chamikaramj, @DariuszAniszewski, @lgajowy, @szewi, @kkucharc

Euphoria Java 8 DSL

Easy to use Java 8 DSL for the Beam Java SDK. Provides a high-level abstraction of Beam transformations, which is both easy to read and write. Can be used as a complement to existing Beam pipelines (convertible back and forth). You can have a glimpse of the API at WordCount example.

Improving the contributor experience

Making it easier to write code, run tests, and release. Investigating using docker for jenkins builds, automating the release process, and improving the reliability of tests.

Ideas and help welcome! Contact: Alan Myrvold, Mark Liu, Yifan Zou

Beam SQL

Beam SQL has lots of areas to contribute: support for new operators, new connectors, performance measurement and improvement, more full specification and testing, etc.

Add benchmarks to continuous integration

Run Nexmark benchmark queries after each commit for Spark, Flink and Direct Runner and export response times to performance dashboards

Extract metrics in a runner agnostic way

Metrics are pushed by the runners to configurable sinks (HTTP REST sink available). It is already enabled in Filnk and Spark runner. Work is in progress for Dataflow

Stale pull requests

The community will close stale pull requests in order to keep the project healthy. A pull request becomes stale after its author fails to respond to actionable comments for 60 days. Author of a closed pull request is welcome to reopen the same pull request again in the future. The associated JIRAs will be unassigned from the author but will stay open.