Apache Beam Contribution Guide
The Apache Beam community welcomes contributions from anyone!
There are lots of opportunities:
- ask or answer questions on email@example.com or stackoverflow
- review proposed design ideas on firstname.lastname@example.org
- improve the documentation
- contribute bug reports
- write new examples
- add new user-facing libraries (new statistical libraries, new IO connectors, etc)
- improve your favorite language SDK (Java, Python, Go, etc)
- improve specific runners (Apache Apex, Apache Flink, Apache Spark, Google Cloud Dataflow, etc)
- work on the core programming model (what is a Beam pipeline and how does it run?)
- improve the developer experience on Windows
Most importantly, if you have an idea of how to contribute, then do it!
For a list of open starter tasks, check https://s.apache.org/beam-starter-tasks.
Discussons about contributing code to beam happens on the dev@ mailing list. Introduce yourself!
Questions can be asked on the #beam channel of the ASF slack. Introduce yourself!
Coding happens at https://github.com/apache/beam. To contribute, follow the usual GitHub process: fork the repo, make your changes, and open a pull request and @mention a reviewer. If you have more than one commit in your change, you many be asked to rebase and squash the commits. If you are unfamiliar with this workflow, GitHub maintains these helpful guides:
If your change is large or it is your first change, it is a good idea to discuss it on the dev@ mailing list
Documentation happens at https://github.com/apache/beam-site and contributions are welcome.
Large contributions require a signed Individual Contributor License Agreement (ICLA) to the Apache Software Foundation (ASF).
If you are contributing a
PTransform to Beam, we have an extensive
PTransform Style Guide.
Building & Testing
We use the Gradle Build Tool.
You do not need to install Gradle, but you do need a Java SDK installed. You can develop on Linux, macOS, or Microsoft Windows. There have been issues noted when developing using Windows; feel free to contribute fixes to make it easier.
Familiarize yourself with the project structure. At the root of the git repository, run:
$ ./gradlew projects
Run the entire set of tests with:
$ ./gradlew check
You can limit testing to a particular module. Gradle will build just the necessary things to run those tests. For example:
$ ./gradlew -p sdks/go check $ ./gradlew -p sdks/java/io/cassandra check $ ./gradlew -p runners/flink check
Examine the available tasks in a project. For the default set of tasks, use:
$ ./gradlew tasks
For a given module, use:
$ ./gradlew sdks/java/io/cassandra tasks
For an exhaustive list of tasks, use:
$ ./gradlew tasks --all
Developing with an IDE
Generate an IDEA project .ipr file with:
$ ./gradlew idea
When your change is ready to be reviewed and merged, create a pull request.
Format the pull request title like
[BEAM-XXX] Fixes bug in ApproximateQuantiles,
where you replace BEAM-XXX with the appropriate JIRA issue.
This will automatically link the pull request to the issue.
Use @mention in the pull request to notify the reviewer.
The pull request and any changes pushed to it will trigger precommit jobs. If a test fails and appears unrelated to your change, you can cause tests to be re-run by adding a single line comment on your PR
retest this please
There are other trigger phrases for post-commit tests found in .testinfra/jenkins, but use these sparingly because postcommit tests consume shared development resources.
Developing with the Python SDK
Gradle can build and test python, and is used by the Jenkins jobs, so needs to be maintained.
You can directly use the Python toolchain instead of having Gradle orchestrate it, which may be faster for you, but it is your preference. If you do want to use Python tools directly, we recommend setting up a virtual environment before testing your code.
If you update any of the cythonized files in Python SDK,
you must install the
cython package before running following command to
properly test your code.
The following commands should be run in the
This installs Python from source and includes the test and gcp dependencies.
$ virtualenv env $ . ./env/bin/activate (env) $ pip install -e .[gcp,test]
> c:\Python27\python.exe -m virtualenv > env\Scripts\activate (env) > pip install -e .[gcp,test]
This command runs all Python tests. The nose dependency is installed by [test] in pip install.
(env) $ python setup.py nosetests
You can use following command to run a single test method.
(env) $ python setup.py nosetests --tests <module>:<test class>.<test method> For example: (env) $ python setup.py nosetests --tests apache_beam.io.textio_test:TextSourceTest.test_progress
You can deactivate the virtualenv when done.
(env) $ deactivate $
To check just for Python lint errors, run the following command.
$ ../../gradlew lint
tox commands to run the lint tasks:
$ tox -e py27-lint # For python 2.7 $ tox -e py3-lint # For python 3 $ tox -e py27-lint3 # For python 2-3 compatibility
This step is only required for testing SDK code changes remotely (not using directrunner). In order to do this you must build the Beam tarball. From the root of the git repository, run:
$ cd sdks/python/ $ python setup.py sdist
--sdk_location flag to use the newly built version. For example:
$ python setup.py sdist > /dev/null && \ python -m apache_beam.examples.wordcount ... \ --sdk_location dist/apache-beam-2.5.0.dev0.tar.gz
Contributing to the website
The Beam website is in the Beam Site GitHub
mirror repository in the
explains how to modify different parts of the site. The GitHub workflow is the
same - make your change and open a pull request.
Issues are tracked in the website component in JIRA.
Works in progress
A great way to contribute is to join an existing effort. There are many works in progress, some on branches because they are very incomplete.
The primary Beam vision: Any SDK on any runner. This is a cross-cutting effort across Java, Python, and Go, and every Beam runner.
Apache Spark 2.0 Runner
Python 3 Support
Work is in progress to add Python 3 support to Beam. Current goal is to make Beam codebase compatible both with Python 2.7 and Python 3.4.
Contributions are welcome! If you are interested to help, you can select a subpackage to port and assign yourself the corresponding issue. Comment on the issue if you cannot assign it yourself. When submitting a new PR, please tag @RobbeSneyders, @aaltay, and @tvalentyn.
Next Java LTS version support (Java 11 / 18.9)
Work to support the next LTS release of Java is in progress. For more details about the scope and info on the various tasks please see the JIRA ticket.
IO Performance Testing
We are also working on writing Performance Tests for IOs and developing a Performance Testing Framework for them. Contributions are welcome in the following areas:
- developing more IO Performance Tests (IOITs)
- providing necessary kubernetes infrastructure (eg. for databases or filesystems to be used in tests)
- running Performance Tests on runners other than Dataflow and Direct
- improving existing Performance Testing Framework and it’s documentation
Euphoria Java 8 DSL
Easy to use Java 8 DSL for the Beam Java SDK. Provides a high-level abstraction of Beam transformations, which is both easy to read and write. Can be used as a complement to existing Beam pipelines (convertible back and forth). You can have a glimpse of the API at WordCount example.
Improving the contributor experience
Making it easier to write code, run tests, and release. Investigating using docker for jenkins builds, automating the release process, and improving the reliability of tests.
Beam SQL has lots of areas to contribute: support for new operators, new connectors, performance measurement and improvement, more full specification and testing, etc.
Add benchmarks to continuous integration
Run Nexmark benchmark queries after each commit for Spark, Flink and Direct Runner and export response times to performance dashboards
Extract metrics in a runner agnostic way
Metrics are pushed by the runners to configurable sinks (HTTP REST sink available). It is already enabled in Filnk and Spark runner. Work is in progress for Dataflow
Stale pull requests
The community will close stale pull requests in order to keep the project healthy. A pull request becomes stale after its author fails to respond to actionable comments for 60 days. Author of a closed pull request is welcome to reopen the same pull request again in the future. The associated JIRAs will be unassigned from the author but will stay open.