In this post, you’ll learn about the current state of support for input streaming connectors in Apache Beam. For more context, you’ll also learn about the corresponding state of support in Apache Spark.
With batch processing, you might load data from any source, including a database system. Even if there are no specific SDKs available for those database systems, you can often resort to using a JDBC driver. With streaming, implementing a proper data pipeline is arguably more challenging as generally fewer source types are available. For that reason, this article particularly focuses on the streaming use case.
Connectors for Java
Beam has an official Java SDK and has several execution engines, called runners. In most cases it is fairly easy to transfer existing Beam pipelines written in Java or Scala to a Spark environment by using the Spark Runner.
Spark is written in Scala and has a Java API. Spark’s source code compiles to Java bytecode and the binaries are run by a Java Virtual Machine. Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa).
Spark offers two approaches to streaming: Discretized Streaming (or DStreams) and Structured Streaming. DStreams are a basic abstraction that represents a continuous series of Resilient Distributed Datasets (or RDDs). Structured Streaming was introduced more recently (the alpha release came with Spark 2.1.0) and is based on a model where live data is continuously appended to a table structure.
Spark Structured Streaming supports file sources (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and Kafka as streaming inputs. Spark maintains built-in connectors for DStreams aimed at third-party services, such as Kafka or Flume, while other connectors are available through linking external dependencies, as shown in the table below.
Below are the main streaming input connectors for available for Beam and Spark DStreams in Java:
|Apache Beam||Apache Spark DStreams|
(Spark treats most Unix systems as HDFS-compatible, but the location should be accessible from all nodes)
|FileIO + HadoopFileSystemOptions||HdfsUtils|
|Object Stores||Cloud Storage
|FileIO + GcsOptions||hadoopConfiguration and textFileStream|
|FileIO + S3Options|
|Cloud Pub/Sub||PubsubIO||spark-streaming-pubsub from Apache Bahir|
|Other||Custom receivers||Read Transforms||receiverStream|
Connectors for Python
Beam has an official Python SDK that currently supports a subset of the streaming features available in the Java SDK. Active development is underway to bridge the gap between the featuresets in the two SDKs. Currently for Python, the Direct Runner and Dataflow Runner are supported, and several streaming options were introduced in beta in version 2.5.0.
Spark also has a Python SDK called PySpark. As mentioned earlier, Scala code compiles to a bytecode that is executed by the JVM. PySpark uses Py4J, a library that enables Python programs to interact with the JVM and therefore access Java libraries, interact with Java objects, and register callbacks from Java. This allows PySpark to access native Spark objects like RDDs. Spark Structured Streaming supports file sources (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and Kafka as streaming inputs.
Below are the main streaming input connectors for available for Beam and Spark DStreams in Python:
|Apache Beam||Apache Spark DStreams|
|HDFS||io.hadoopfilesystem||hadoopConfiguration (Access through
|Object stores||Google Cloud Storage||io.gcp.gcsio||textFileStream|
|Other||Custom receivers||BoundedSource and RangeTracker||N/A|
Connectors for other languages
Since Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa), you can use the same Java connectors described above in your Scala programs. Apache Beam also has a Scala API open-sourced by Spotify.
A Go SDK for Apache Beam is under active development. It is currently experimental and is not recommended for production. Spark does not have an official Go SDK.
We hope this article inspired you to try new and interesting ways of connecting streaming sources to your Beam pipelines!
Check out the following links for further information: