In this post, you’ll learn about the current state of support for input streaming connectors in Apache Beam. For more context, you’ll also learn about the corresponding state of support in Apache Spark.

With batch processing, you might load data from any source, including a database system. Even if there are no specific SDKs available for those database systems, you can often resort to using a JDBC driver. With streaming, implementing a proper data pipeline is arguably more challenging as generally fewer source types are available. For that reason, this article particularly focuses on the streaming use case.

Connectors for Java

Beam has an official Java SDK and has several execution engines, called runners. In most cases it is fairly easy to transfer existing Beam pipelines written in Java or Scala to a Spark environment by using the Spark Runner.

Spark is written in Scala and has a Java API. Spark’s source code compiles to Java bytecode and the binaries are run by a Java Virtual Machine. Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa).

Spark offers two approaches to streaming: Discretized Streaming (or DStreams) and Structured Streaming. DStreams are a basic abstraction that represents a continuous series of Resilient Distributed Datasets (or RDDs). Structured Streaming was introduced more recently (the alpha release came with Spark 2.1.0) and is based on a model where live data is continuously appended to a table structure.

Spark Structured Streaming supports file sources (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and Kafka as streaming inputs. Spark maintains built-in connectors for DStreams aimed at third-party services, such as Kafka or Flume, while other connectors are available through linking external dependencies, as shown in the table below.

Below are the main streaming input connectors for available for Beam and Spark DStreams in Java:

Apache Beam Apache Spark DStreams
File Systems Local
(Using the file:// URI)
TextIO textFileStream
(Spark treats most Unix systems as HDFS-compatible, but the location should be accessible from all nodes)
HDFS
(Using the hdfs:// URI)
FileIO + HadoopFileSystemOptions HdfsUtils
Object Stores Cloud Storage
(Using the gs:// URI)
FileIO + GcsOptions hadoopConfiguration and textFileStream
S3
(Using the s3:// URI)
FileIO + S3Options
Messaging Queues Kafka KafkaIO spark-streaming-kafka
Kinesis KinesisIO spark-streaming-kinesis
Cloud Pub/Sub PubsubIO spark-streaming-pubsub from Apache Bahir
Other Custom receivers Read Transforms receiverStream

Connectors for Python

Beam has an official Python SDK that currently supports a subset of the streaming features available in the Java SDK. Active development is underway to bridge the gap between the featuresets in the two SDKs. Currently for Python, the Direct Runner and Dataflow Runner are supported, and several streaming options were introduced in beta in version 2.5.0.

Spark also has a Python SDK called PySpark. As mentioned earlier, Scala code compiles to a bytecode that is executed by the JVM. PySpark uses Py4J, a library that enables Python programs to interact with the JVM and therefore access Java libraries, interact with Java objects, and register callbacks from Java. This allows PySpark to access native Spark objects like RDDs. Spark Structured Streaming supports file sources (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and Kafka as streaming inputs.

Below are the main streaming input connectors for available for Beam and Spark DStreams in Python:

Apache Beam Apache Spark DStreams
File Systems Local io.textio textFileStream
HDFS io.hadoopfilesystem hadoopConfiguration (Access through sc._jsc with Py4J) and textFileStream
Object stores Google Cloud Storage io.gcp.gcsio textFileStream
S3 N/A
Messaging Queues Kafka N/A KafkaUtils
Kinesis N/A KinesisUtils
Cloud Pub/Sub io.gcp.pubsub N/A
Other Custom receivers BoundedSource and RangeTracker N/A

Connectors for other languages

Scala

Since Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa), you can use the same Java connectors described above in your Scala programs. Apache Beam also has a Scala API open-sourced by Spotify.

Go

A Go SDK for Apache Beam is under active development. It is currently experimental and is not recommended for production. Spark does not have an official Go SDK.

R

Apache Beam does not have an official R SDK. Spark Structured Streaming is supported by an R SDK, but only for file sources as a streaming input.

Next steps

We hope this article inspired you to try new and interesting ways of connecting streaming sources to your Beam pipelines!

Check out the following links for further information:

  • See a full list of all built-in and in-progress I/O Transforms for Apache Beam.
  • Learn about some Apache Beam mobile gaming pipeline examples.