Beam Capability Matrix

Apache Beam provides a portable API layer for building sophisticated data-parallel processing pipelines that may be executed across a diversity of execution engines, or runners. The core concepts of this layer are based upon the Beam Model (formerly referred to as the Dataflow Model), and implemented to varying degrees in each Beam runner. To help clarify the capabilities of individual runners, we’ve created the capability matrix below.

Individual capabilities have been grouped by their corresponding What / Where / When / How question:

For more details on the What / Where / When / How breakdown of concepts, we recommend reading through the Streaming 102 post on O’Reilly Radar.

Note that in the future, we intend to add additional tables beyond the current set, for things like runtime characteristics (e.g. at-least-once vs exactly-once), performance, etc.

How to read the tables
Tools we are comparing
PropertiesDoes this tool have this property?Yes/Partially/No/Unverified
What do those signs mean?
Yes
~
Partially
?
Unverified
No

What is being computed?

ParDo
GroupByKey
Flatten
Combine
Composite Transforms
Side Inputs
Source API
Metrics
Stateful Processing
Google Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
SEE DETAILS AND FULL VERSION HERE.

Bounded Splittable DoFn Support Status

Base
Side Inputs
Splittable DoFn Initiated Checkpointing
Dynamic Splitting
Bundle Finalization
Google Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
~
~
~
~
~
~
~
~
SEE DETAILS AND FULL VERSION HERE.

Unbounded Splittable DoFn Support Status

Base
Side Inputs
Splittable DoFn Initiated Checkpointing
Dynamic Splitting
Bundle Finalization
Google Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
~
~
~
SEE DETAILS AND FULL VERSION HERE.

Where in event time?

Global windows
Fixed windows
Sliding windows
Session windows
Custom windows
Custom merging windows
Timestamp control
Google Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
~
~
~
~
~
~
~
SEE DETAILS AND FULL VERSION HERE.

When in processing time?

Configurable triggering
Event-time triggers
Processing-time triggers
Count triggers
Composite triggers
Allowed lateness
Timers
Google Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
~
~
~
~
~
~
~
~
~
~
~
~
~
SEE DETAILS AND FULL VERSION HERE.

How do refinements relate?

Discarding
Accumulating
Google Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
~
SEE DETAILS AND FULL VERSION HERE.

Additional common features not yet part of the Beam model

Drain
Checkpoint
Key-ordered delivery
Google Cloud DataflowApache FlinkApache Spark (RDD/DStream based)Apache Spark Structured Streaming (Dataset based)Apache SamzaApache NemoHazelcast JetTwister2Python Direct FnRunnerGo Direct Runner
~
~
~
~
~
~
~
?
?
?
~
?
?
?
SEE DETAILS AND FULL VERSION HERE.