BoundedDatasetFactory (Apache Beam 2.54.0)

java.lang.Object
- org.apache.beam.runners.spark.structuredstreaming.io.BoundedDatasetFactory

public class BoundedDatasetFactory
extends java.lang.Object

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>>`	`createDatasetFromRDD(org.apache.spark.sql.SparkSession session, BoundedSource<T> source, java.util.function.Supplier<PipelineOptions> options, org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)` Create a `Dataset` for a `BoundedSource` via a Spark `RDD`.
`static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>>`	`createDatasetFromRows(org.apache.spark.sql.SparkSession session, BoundedSource<T> source, java.util.function.Supplier<PipelineOptions> options, org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)` Create a `Dataset` for a `BoundedSource` via a Spark `Table`.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Detail

createDatasetFromRows

public static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>> createDatasetFromRows(org.apache.spark.sql.SparkSession session,
                                                                                                                BoundedSource<T> source,
                                                                                                                java.util.function.Supplier<PipelineOptions> options,
                                                                                                                org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)

Create a Dataset for a BoundedSource via a Spark Table.

Unfortunately tables are expected to return an InternalRow, requiring serialization. This makes this approach at the time being significantly less performant than creating a dataset from an RDD.

createDatasetFromRDD

public static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>> createDatasetFromRDD(org.apache.spark.sql.SparkSession session,
                                                                                                               BoundedSource<T> source,
                                                                                                               java.util.function.Supplier<PipelineOptions> options,
                                                                                                               org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)

Create a Dataset for a BoundedSource via a Spark RDD.

This is currently the most efficient approach as it avoid any serialization overhead.

Class BoundedDatasetFactory

Method Summary

Methods inherited from class java.lang.Object

Method Detail

createDatasetFromRows

createDatasetFromRDD