Splittable DoFn in Apache Beam is Ready to Use

We are pleased to announce that Splittable DoFn (SDF) is ready for use in the Beam Python, Java, and Go SDKs for versions 2.25.0 and later.

In 2017, Splittable DoFn Blog Post proposed to build Splittable DoFn APIs as the new recommended way of building I/O connectors. Splittable DoFn is a generalization of DoFn that gives it the core capabilities of Source while retaining DoFn’s syntax, flexibility, modularity, and ease of coding. Thus, it becomes much easier to develop complex I/O connectors with simpler and reusable code.

SDF has three advantages over the existing UnboundedSource and BoundedSource:

  • SDF provides a unified set of APIs to handle both unbounded and bounded cases.
  • SDF enables reading from source descriptors dynamically.
    • Taking KafkaIO as an example, within UnboundedSource/BoundedSource API, you must specify the topic and partition you want to read from during pipeline construction time. There is no way for UnboundedSource/BoundedSource to accept topics and partitions as inputs during execution time. But it’s built-in to SDF.
  • SDF fits in as any node on a pipeline freely with the ability of splitting.
    • UnboundedSource/BoundedSource has to be the root node of the pipeline to gain performance benefits from splitting strategies, which limits many real-world usages. This is no longer a limit for an SDF.

As SDF is now ready to use with all the mentioned improvements, it is the recommended way to build the new I/O connectors. Try out building your own Splittable DoFn by following the programming guide. We have provided tonnes of common utility classes such as common types of RestrictionTracker and WatermarkEstimator in Beam SDK, which will help you onboard easily. As for the existing I/O connectors, we have wrapped UnboundedSource and BoundedSource implementations into Splittable DoFns, yet we still encourage developers to convert UnboundedSource/BoundedSource into actual Splittable DoFn implementation to gain more performance benefits.

Many thanks to every contributor who brought this highly anticipated design into the data processing world. We are really excited to see that users benefit from SDF.

Below are some real-world SDF examples for you to explore.

Real world Splittable DoFn examples

Java Examples

Python Examples

Go Examples

  • textio.ReadSdf implements reading from text files using a splittable DoFn.