Beam Capability Matrix

Apache Beam provides a portable API layer for building sophisticated data-parallel processing pipelines that may be executed across a diversity of execution engines, or runners. The core concepts of this layer are based upon the Beam Model (formerly referred to as the Dataflow Model), and implemented to varying degrees in each Beam runner. To help clarify the capabilities of individual runners, we’ve created the capability matrix below.

Individual capabilities have been grouped by their corresponding What / Where / When / How question:

For more details on the What / Where / When / How breakdown of concepts, we recommend reading through the Streaming 102 post on O’Reilly Radar.

Note that in the future, we intend to add additional tables beyond the current set, for things like runtime characterstics (e.g. at-least-once vs exactly-once), performance, etc.

Beam Model Google Cloud Dataflow Apache Flink Apache Spark Apache Apex Apache Gearpump Apache Hadoop MapReduce JStorm IBM Streams
ParDo
GroupByKey
~
Flatten
Combine
Composite Transforms
~
~
~
~
~
~
Side Inputs
Source API
~
Splittable DoFn
~
~
Metrics
~
~
~
~
~
~
~
Stateful Processing
~
~
~
~
~
~
Beam Model Google Cloud Dataflow Apache Flink Apache Spark Apache Apex Apache Gearpump Apache Hadoop MapReduce JStorm IBM Streams
Global windows
Fixed windows
Sliding windows
Session windows
Custom windows
Custom merging windows
Timestamp control
Beam Model Google Cloud Dataflow Apache Flink Apache Spark Apache Apex Apache Gearpump Apache Hadoop MapReduce JStorm IBM Streams
Configurable triggering
Event-time triggers
Processing-time triggers
Count triggers
[Meta]data driven triggers
✕ (BEAM-101)
Composite triggers
Allowed lateness
Timers
~
~
~
~
Beam Model Google Cloud Dataflow Apache Flink Apache Spark Apache Apex Apache Gearpump Apache Hadoop MapReduce JStorm IBM Streams
Discarding
Accumulating
Accumulating & Retracting
✕ (BEAM-91)