What is being computed?
ParDo |
---|
GroupByKey |
Flatten |
Combine |
Composite Transforms |
Side Inputs |
Source API |
Metrics |
Stateful Processing |
Google Cloud Dataflow | Apache Flink | Apache Spark (RDD/DStream based) | Apache Spark Structured Streaming (Dataset based) | Apache Samza | Apache Nemo | Hazelcast Jet | Twister2 | Python Direct FnRunner | Go Direct Runner |
---|
Yes : fully supported Batch mode uses large bundle sizes. Streaming uses smaller bundle sizes. | Yes : fully supported ParDo itself, as per-element transformation with UDFs, is fully supported by Flink for both batch and streaming. | Yes : fully supported ParDo applies per-element transformations as Spark FlatMapFunction. | Partially : fully supported in batch mode ParDo applies per-element transformations as Spark FlatMapFunction. | Yes : fully supported Supported with per-element transformation. | Yes : fully supported | Yes : fully supported | Yes : fully supported | ||
Yes : fully supported | Yes : fully supported Uses Flink's keyBy for key grouping. When grouping by window in streaming (creating the panes) the Flink runner uses the Beam code. This guarantees support for all windowing and triggering mechanisms. | Partially : fully supported in batch mode Using Spark's <tt>groupByKey</tt>. GroupByKey with multiple trigger firings in streaming mode is a work in progress. | Partially : fully supported in batch mode Using Spark's <tt>groupByKey</tt>. | Yes : fully supported Uses Samza's partitionBy for key grouping and Beam's logic for window aggregation and triggering. | Yes : fully supported | Yes : fully supported | Yes : fully supported | ||
Yes : fully supported | Yes : fully supported | Yes : fully supported | Partially : fully supported in batch mode Some corner cases like flatten on empty collections are not yet supported. | Yes : fully supported | Yes : fully supported | Yes : fully supported | Yes : fully supported | ||
Yes : efficient execution | Yes : fully supported Uses a combiner for pre-aggregation for batch and streaming. | Yes : fully supported Using Spark's <tt>combineByKey</tt> and <tt>aggregate</tt> functions. | Partially : fully supported in batch mode Using Spark's <tt>Aggregator</tt> and agg function | Yes : fully supported Use combiner for efficient pre-aggregation. | Yes : fully supported Batch mode uses pre-aggregation | Yes : fully supported Batch mode uses pre-aggregation | Yes : fully supported | ||
Partially : supported via inlining Currently composite transformations are inlined during execution. The structure is later recreated from the names, but other transform level information (if added to the model) will be lost. | Partially : supported via inlining | Partially : supported via inlining | Partially : supported via inlining only in batch mode | Partially : supported via inlining | Yes : fully supported | Partially : supported via inlining | Partially : supported via inlining | ||
Yes : some size restrictions in streaming Batch mode supports a distributed implementation, but streaming mode may force some size restrictions. Neither mode is able to push lookups directly up into key-based sources. | Yes : some size restrictions in streaming Batch mode supports a distributed implementation, but streaming mode may force some size restrictions. Neither mode is able to push lookups directly up into key-based sources. | Yes : fully supported Using Spark's broadcast variables. In streaming mode, side inputs may update but only between micro-batches. | Partially : fully supported in batch mode Using Spark's broadcast variables. | Yes : fully supported Uses Samza's broadcast operator to distribute the side inputs. | Yes : fully supported | Partially : with restrictions Supported only when the side input source is bounded and windowing uses global window | Yes : fully supported | ||
Yes : fully supported Support includes autotuning features (https://cloud.google.com/dataflow/service/dataflow-service-desc#autotuning-features). | Yes : fully supported | Yes : fully supported | Partially : bounded source only Using Spark's DatasourceV2 API in microbatch mode (Continuous streaming mode is tagged experimental in spark and does not support aggregation). | Yes : fully supported | Yes : fully supported | Yes : fully supported | Yes : fully supported | ||
Partially Gauge metrics are not supported. All other metric types are supported. | Partially : All metrics types are supported. Only attempted values are supported. No committed values for metrics. | Partially : All metric types are supported. Only attempted values are supported. No committed values for metrics. | Partially : All metric types are supported in batch mode. Only attempted values are supported. No committed values for metrics. | Partially : Counter and Gauge are supported. Only attempted values are supported. No committed values for metrics. | No : not implemented | Partially : All metrics types supported, both in batching and streaming mode. Doesn't differentiate between committed and attempted values. | No : not implemented | ||
Partially : non-merging windows State is supported for non-merging windows. SetState and MapState are not yet supported. | Partially : non-merging windows State is supported for non-merging windows. SetState and MapState are not yet supported. | Partially : full support in batch mode | No : not implemented | Partially : non-merging windows States are backed up by either rocksDb KV store or in-memory hash map, and persist using changelog. | No : not implemented | Partially : non-merging windows | No : not implemented |
Last updated on 2024/10/14
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!