A review of input streaming connectors

blog

2018/08/20

A review of input streaming connectors

Leonid Kuligin [@lkulighin] & Julien Phalip [@julienphalip]

In this post, you’ll learn about the current state of support for input streaming connectors in Apache Beam. For more context, you’ll also learn about the corresponding state of support in Apache Spark.

With batch processing, you might load data from any source, including a database system. Even if there are no specific SDKs available for those database systems, you can often resort to using a JDBC driver. With streaming, implementing a proper data pipeline is arguably more challenging as generally fewer source types are available. For that reason, this article particularly focuses on the streaming use case.

Connectors for Java

Beam has an official Java SDK and has several execution engines, called runners. In most cases it is fairly easy to transfer existing Beam pipelines written in Java or Scala to a Spark environment by using the Spark Runner.

Spark is written in Scala and has a Java API. Spark’s source code compiles to Java bytecode and the binaries are run by a Java Virtual Machine. Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa).

Spark offers two approaches to streaming: Discretized Streaming (or DStreams) and Structured Streaming. DStreams are a basic abstraction that represents a continuous series of Resilient Distributed Datasets (or RDDs). Structured Streaming was introduced more recently (the alpha release came with Spark 2.1.0) and is based on a model where live data is continuously appended to a table structure.

Spark Structured Streaming supports file sources (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and Kafka as streaming inputs. Spark maintains built-in connectors for DStreams aimed at third-party services, such as Kafka or Flume, while other connectors are available through linking external dependencies, as shown in the table below.

Below are the main streaming input connectors for available for Beam and Spark DStreams in Java:

		Apache Beam	Apache Spark DStreams
File Systems	Local (Using the `file://` URI)	TextIO	textFileStream (Spark treats most Unix systems as HDFS-compatible, but the location should be accessible from all nodes)
File Systems	HDFS (Using the `hdfs://` URI)	FileIO + HadoopFileSystemOptions	HdfsUtils
Object Stores	Cloud Storage (Using the `gs://` URI)	FileIO + GcsOptions	hadoopConfiguration and textFileStream
Object Stores	S3 (Using the `s3://` URI)	FileIO + S3Options	hadoopConfiguration and textFileStream
Messaging Queues	Kafka	KafkaIO	spark-streaming-kafka
	Kinesis	KinesisIO	spark-streaming-kinesis
	Cloud Pub/Sub	PubsubIO	spark-streaming-pubsub from Apache Bahir
Other	Custom receivers	Read Transforms	receiverStream

Connectors for Python

Beam has an official Python SDK that currently supports a subset of the streaming features available in the Java SDK. Active development is underway to bridge the gap between the featuresets in the two SDKs. Currently for Python, the Direct Runner and Dataflow Runner are supported, and several streaming options were introduced in beta in version 2.5.0.

Spark also has a Python SDK called PySpark. As mentioned earlier, Scala code compiles to a bytecode that is executed by the JVM. PySpark uses Py4J, a library that enables Python programs to interact with the JVM and therefore access Java libraries, interact with Java objects, and register callbacks from Java. This allows PySpark to access native Spark objects like RDDs. Spark Structured Streaming supports file sources (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and Kafka as streaming inputs.

Below are the main streaming input connectors for available for Beam and Spark DStreams in Python:

		Apache Beam	Apache Spark DStreams
File Systems	Local	io.textio	textFileStream
File Systems	HDFS	io.hadoopfilesystem	hadoopConfiguration (Access through `sc._jsc` with Py4J) and textFileStream
Object stores	Google Cloud Storage	io.gcp.gcsio	textFileStream
Object stores	S3	N/A	textFileStream
Messaging Queues	Kafka	N/A	KafkaUtils
	Kinesis	N/A	KinesisUtils
	Cloud Pub/Sub	io.gcp.pubsub	N/A
Other	Custom receivers	BoundedSource and RangeTracker	N/A

Connectors for other languages

Scala

Since Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa), you can use the same Java connectors described above in your Scala programs. Apache Beam also has a Scala API open-sourced by Spotify.

Go

A Go SDK for Apache Beam is under active development. It is currently experimental and is not recommended for production. Spark does not have an official Go SDK.

R

Apache Beam does not have an official R SDK. Spark Structured Streaming is supported by an R SDK, but only for file sources as a streaming input.

Next steps

We hope this article inspired you to try new and interesting ways of connecting streaming sources to your Beam pipelines!

Check out the following links for further information:

See a full list of all built-in and in-progress I/O Transforms for Apache Beam.
Learn about some Apache Beam mobile gaming pipeline examples.

Connectors for Java

Connectors for Python

Connectors for other languages

Scala

Go

R

Next steps

Latest from the blog