public class BoundedDatasetFactory
extends java.lang.Object
| Modifier and Type | Method and Description |
|---|---|
static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>> |
createDatasetFromRDD(org.apache.spark.sql.SparkSession session,
BoundedSource<T> source,
java.util.function.Supplier<PipelineOptions> options,
org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)
|
static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>> |
createDatasetFromRows(org.apache.spark.sql.SparkSession session,
BoundedSource<T> source,
java.util.function.Supplier<PipelineOptions> options,
org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)
|
public static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>> createDatasetFromRows(org.apache.spark.sql.SparkSession session,
BoundedSource<T> source,
java.util.function.Supplier<PipelineOptions> options,
org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)
Dataset for a BoundedSource via a Spark Table.
Unfortunately tables are expected to return an InternalRow, requiring serialization.
This makes this approach at the time being significantly less performant than creating a
dataset from an RDD.
public static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>> createDatasetFromRDD(org.apache.spark.sql.SparkSession session,
BoundedSource<T> source,
java.util.function.Supplier<PipelineOptions> options,
org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)
Dataset for a BoundedSource via a Spark RDD.
This is currently the most efficient approach as it avoid any serialization overhead.