Apache Beam Contribution Guide
The Apache Beam community welcomes contributions from anyone!
There are lots of opportunities:
- ask or answer questions on firstname.lastname@example.org or stackoverflow
- review proposed design ideas on email@example.com
- improve the documentation
- contribute bug reports
- write new examples
- add new user-facing libraries (new statistical libraries, new IO connectors, etc)
- improve your favorite language SDK (Java, Python, Go, etc)
- improve specific runners (Apache Apex, Apache Flink, Apache Spark, Google Cloud Dataflow, etc)
- work on the core programming model (what is a Beam pipeline and how does it run?)
- improve the developer experience on Windows
Most importantly, if you have an idea of how to contribute, then do it!
For a list of open starter tasks, check https://s.apache.org/beam-starter-tasks.
For the Beam issue tracker (JIRA), anyone can access it and browse issues. Anyone can register an account and login to create issues or add comments. Only contributors can be assigned issues. If you want to be assigned issues, a PMC member can add you to the project contributor group. Email the dev@ mailing list to ask to be added as a contributor in the Beam issue tracker.
Discussons about contributing code to beam happens on the dev@ mailing list. Introduce yourself!
Questions can be asked on the #beam channel of the ASF slack. Introduce yourself!
Coding happens at https://github.com/apache/beam. To contribute, follow the usual GitHub process: fork the repo, make your changes, and open a pull request and @mention a reviewer. If you have more than one commit in your change, you may be asked to rebase and squash the commits. If you are unfamiliar with this workflow, GitHub maintains these helpful guides:
If your change is large or it is your first change, it is a good idea to discuss it on the dev@ mailing list
Documentation happens at https://github.com/apache/beam-site and contributions are welcome.
Large contributions require a signed Individual Contributor License Agreement (ICLA) to the Apache Software Foundation (ASF).
If you are contributing a
PTransform to Beam, we have an extensive
PTransform Style Guide.
Building & Testing
We use the Gradle Build Tool.
You do not need to install Gradle, but you do need a Java SDK installed. You can develop on Linux, macOS, or Microsoft Windows. There have been issues noted when developing using Windows; feel free to contribute fixes to make it easier.
Familiarize yourself with the project structure. At the root of the git repository, run:
$ ./gradlew projects
Run the entire set of tests with:
$ ./gradlew check
You can limit testing to a particular module. Gradle will build just the necessary things to run those tests. For example:
$ ./gradlew -p sdks/go check $ ./gradlew -p sdks/java/io/cassandra check $ ./gradlew -p runners/flink check
Examine the available tasks in a project. For the default set of tasks, use:
$ ./gradlew tasks
For a given module, use:
$ ./gradlew sdks/java/io/cassandra tasks
For an exhaustive list of tasks, use:
$ ./gradlew tasks --all
You might get an OutOfMemoryException during the Gradle build. If you have more memory
available, you can try to increase the memory allocation of the Gradle JVM. Otherwise,
disabling parallel test execution reduces memory consumption. In the root of the Beam
source, edit the
gradle.properties file and add/modify the following lines:
org.gradle.parallel=false org.gradle.jvmargs=-Xmx2g -XX:MaxPermSize=512m
When your change is ready to be reviewed and merged, create a pull request.
Format the pull request title like
[BEAM-XXX] Fixes bug in ApproximateQuantiles,
where you replace BEAM-XXX with the appropriate JIRA issue.
This will automatically link the pull request to the issue.
Use @mention in the pull request to notify the reviewer.
The pull request and any changes pushed to it will trigger pre-commit jobs. If a test fails and appears unrelated to your change, you can cause tests to be re-run by adding a single line comment on your PR
retest this please
There are other trigger phrases for post-commit tests found in .testinfra/jenkins, but use these sparingly because post-commit tests consume shared development resources.
Developing with the Python SDK
Gradle can build and test python, and is used by the Jenkins jobs, so needs to be maintained.
You can directly use the Python toolchain instead of having Gradle orchestrate it, which may be faster for you, but it is your preference. If you do want to use Python tools directly, we recommend setting up a virtual environment before testing your code.
If you update any of the cythonized files in Python SDK,
you must install the
cython package before running following command to
properly test your code.
The following commands should be run in the
This installs Python from source and includes the test and gcp dependencies.
$ virtualenv env $ . ./env/bin/activate (env) $ pip install -e .[gcp,test]
> c:\Python27\python.exe -m virtualenv > env\Scripts\activate (env) > pip install -e .[gcp,test]
This command runs all Python tests. The nose dependency is installed by [test] in pip install.
(env) $ python setup.py nosetests
You can use following command to run a single test method.
(env) $ python setup.py nosetests --tests <module>:<test class>.<test method> For example: (env) $ python setup.py nosetests --tests apache_beam.io.textio_test:TextSourceTest.test_progress
You can deactivate the virtualenv when done.
(env) $ deactivate $
To check just for Python lint errors, run the following command.
$ ../../gradlew lint
tox commands to run the lint tasks:
$ tox -e py27-lint # For python 2.7 $ tox -e py3-lint # For python 3 $ tox -e py27-lint3 # For python 2-3 compatibility
This step is only required for testing SDK code changes remotely (not using directrunner). In order to do this you must build the Beam tarball. From the root of the git repository, run:
$ cd sdks/python/ $ python setup.py sdist
--sdk_location flag to use the newly built version. For example:
$ python setup.py sdist > /dev/null && \ python -m apache_beam.examples.wordcount ... \ --sdk_location dist/apache-beam-2.5.0.dev0.tar.gz
Currently this is a manual process. Tracking bug for automating this: BEAM-4790.
For each file to be reviewed, look for an OWNERS file in its directory. Pick a single reviewer from that file. If the directory doesn’t contain an OWNERS file, go up a directory. Keep going until you find one. Try to limit the number of reviewers to 2 per PR if possible, to minimize reviewer load.
Adding yourself as a reviewer
Find the deepest sub-directory that contains the files you want to be a reviewer
for and add your Github username under
reviewers in the OWNERS file (create a
new OWNERS file if necessary).
The Beam project currently only uses the
reviewers key in OWNERS and no other
features, as reviewer selection is still a manual process.
Contributing to the website
The Beam website is in the
/website directory of the repo. The
README there explains how
to modify different parts of the site. The GitHub workflow is the same - make
your change and open a pull request.
Issues are tracked in the website component in JIRA.
Works in progress
A great way to contribute is to join an existing effort. There are many works in progress, some on branches because they are very incomplete.
The primary Beam vision: Any SDK on any runner. This is a cross-cutting effort across Java, Python, and Go, and every Beam runner.
Apache Spark 2.0 Runner
Python 3 Support
Work is in progress to add Python 3 support to Beam. Current goal is to make Beam codebase compatible both with Python 2.7 and Python 3.4.
Contributions are welcome! If you are interested to help, you can select a subpackage to port and assign yourself the corresponding issue. Comment on the issue if you cannot assign it yourself. When submitting a new PR, please tag @RobbeSneyders, @aaltay, and @tvalentyn.
Next Java LTS version support (Java 11 / 18.9)
Work to support the next LTS release of Java is in progress. For more details about the scope and info on the various tasks please see the JIRA ticket.
IO Performance Testing
We are also working on writing Performance Tests for IOs and developing a Performance Testing Framework for them. Contributions are welcome in the following areas:
- developing more IO Performance Tests (IOITs)
- providing necessary kubernetes infrastructure (eg. for databases or filesystems to be used in tests)
- running Performance Tests on runners other than Dataflow and Direct
- improving existing Performance Testing Framework and it’s documentation
Euphoria Java 8 DSL
Easy to use Java 8 DSL for the Beam Java SDK. Provides a high-level abstraction of Beam transformations, which is both easy to read and write. Can be used as a complement to existing Beam pipelines (convertible back and forth). You can have a glimpse of the API at WordCount example.
Improving the contributor experience
Making it easier to write code, run tests, and release. Investigating using docker for jenkins builds, automating the release process, and improving the reliability of tests.
Beam SQL has lots of areas to contribute: support for new operators, new connectors, performance measurement and improvement, more full specification and testing, etc.
Add benchmarks to continuous integration
Run Nexmark benchmark queries after each commit for Spark, Flink and Direct Runner and export response times to performance dashboards
Extract metrics in a runner agnostic way
Metrics are pushed by the runners to configurable sinks (HTTP REST sink available). It is already enabled in Filnk and Spark runner. Work is in progress for Dataflow
Stale pull requests
The community will close stale pull requests in order to keep the project healthy. A pull request becomes stale after its author fails to respond to actionable comments for 60 days. Author of a closed pull request is welcome to reopen the same pull request again in the future. The associated JIRAs will be unassigned from the author but will stay open.