Pipeline I/O Table of Contents

Authoring I/O Transforms - Overview

A guide for users who need to connect to a data store that isn’t supported by the Built-in I/O Transforms

Introduction

This guide covers how to implement I/O transforms in the Beam model. Beam pipelines use these read and write transforms to import data for processing, and write data to a store.

Reading and writing data in Beam is a parallel task, and using ParDos, GroupByKeys, etc… is usually sufficient. Rarely, you will need the more specialized Source and Sink classes for specific features. There are changes coming soon (SplittableDoFn, BEAM-65) that will make Source unnecessary.

As you work on your I/O Transform, be aware that the Beam community is excited to help those building new I/O Transforms and that there are many examples and helper classes.

Suggested steps for implementers

  1. Check out this guide and come up with your design. If you’d like, you can email the Beam dev mailing list with any questions you might have. It’s good to check there to see if anyone else is working on the same I/O Transform.
  2. If you are planning to contribute your I/O transform to the Beam community, you’ll be going through the normal Beam contribution life cycle - see the Apache Beam Contribution Guide for more details.
  3. As you’re working on your IO transform, see the PTransform Style Guide for specific information about writing I/O Transforms.

Read transforms

Read transforms take data from outside of the Beam pipeline and produce PCollections of data.

For data stores or file types where the data can be read in parallel, you can think of the process as a mini-pipeline. This often consists of two steps:

  1. Splitting the data into parts to be read in parallel
  2. Reading from each of those parts

Each of those steps will be a ParDo, with a GroupByKey in between. The GroupByKey is an implementation detail, but for most runners it allows the runner to use different numbers of workers for:

The GroupByKey will also allow Dynamic Work Rebalancing to occur (on supported runners).

Here are some examples of read transform implementations that use the “reading as a mini-pipeline” model when data can be read in parallel:

For data stores or files where reading cannot occur in parallel, reading is a simple task that can be accomplished with a single ParDo+GroupByKey. For example:

When to implement using the Source API

The above discussion is in terms of ParDos - this is because Sources have proven to be tricky to implement. At this point in time, the recommendation is to use Source only if ParDo doesn’t meet your needs. A class derived from FileBasedSource is often the best option when reading from files.

If you’re trying to decide on whether or not to use Source, feel free to email the Beam dev mailing list and we can discuss the specific pros and cons of your case.

In some cases implementing a Source may be necessary or result in better performance.

Write transforms

Write transforms are responsible for taking the contents of a PCollection and transferring that data outside of the Beam pipeline.

Write transforms can usually be implemented using a single ParDo that writes the records received to the data store.

TODO: this section needs further explanation.

When to implement using the Sink API

You are strongly discouraged from using the Sink class unless you are creating a FileBasedSink. Most of the time, a simple ParDo is all that’s necessary. If you think you have a case that is only possible using a Sink, please email the Beam dev mailing list.

Next steps

This guide is still in progress. There is an open issue to finish the guide: BEAM-1025.