Class BoundedDatasetFactory
java.lang.Object
org.apache.beam.runners.spark.structuredstreaming.io.BoundedDatasetFactory
-
Method Summary
Modifier and TypeMethodDescriptionstatic <T> org.apache.spark.sql.Dataset
<WindowedValue<T>> createDatasetFromRDD
(org.apache.spark.sql.SparkSession session, BoundedSource<T> source, Supplier<PipelineOptions> options, org.apache.spark.sql.Encoder<WindowedValue<T>> encoder) static <T> org.apache.spark.sql.Dataset
<WindowedValue<T>> createDatasetFromRows
(org.apache.spark.sql.SparkSession session, BoundedSource<T> source, Supplier<PipelineOptions> options, org.apache.spark.sql.Encoder<WindowedValue<T>> encoder)
-
Method Details
-
createDatasetFromRows
public static <T> org.apache.spark.sql.Dataset<WindowedValue<T>> createDatasetFromRows(org.apache.spark.sql.SparkSession session, BoundedSource<T> source, Supplier<PipelineOptions> options, org.apache.spark.sql.Encoder<WindowedValue<T>> encoder) Create aDataset
for aBoundedSource
via a SparkTable
.Unfortunately tables are expected to return an
InternalRow
, requiring serialization. This makes this approach at the time being significantly less performant than creating a dataset from an RDD. -
createDatasetFromRDD
public static <T> org.apache.spark.sql.Dataset<WindowedValue<T>> createDatasetFromRDD(org.apache.spark.sql.SparkSession session, BoundedSource<T> source, Supplier<PipelineOptions> options, org.apache.spark.sql.Encoder<WindowedValue<T>> encoder) Create aDataset
for aBoundedSource
via a SparkRDD
.This is currently the most efficient approach as it avoid any serialization overhead.
-