Design Your Pipeline

This page helps you design your Apache Beam pipeline. It includes information about how to determine your pipeline’s structure, how to choose which transforms to apply to your data, and how to determine your input and output methods.

Before reading this section, it is recommended that you become familiar with the information in the Beam programming guide.

What to consider when designing your pipeline

When designing your Beam pipeline, consider a few basic questions:

A basic pipeline

The simplest pipelines represent a linear flow of operations, as shown in Figure 1 below:

A linear pipeline.

Figure 1: A linear pipeline.

However, your pipeline can be significantly more complex. A pipeline represents a Directed Acyclic Graph of steps. It can have multiple input sources, multiple output sinks, and its operations (transforms) can output multiple PCollections. The following examples show some of the different shapes your pipeline can take.

Branching PCollections

It’s important to understand that transforms do not consume PCollections; instead, they consider each individual element of a PCollection and create a new PCollection as output. This way, you can do different things to different elements in the same PCollection.

Multiple transforms process the same PCollection

You can use the same PCollection as input for multiple transforms without consuming the input or altering it.

The pipeline illustrated in Figure 2 below reads its input, first names (Strings), from a single source, a database table, and creates a PCollection of table rows. Then, the pipeline applies multiple transforms to the same PCollection. Transform A extracts all the names in that PCollection that start with the letter ‘A’, and Transform B extracts all the names in that PCollection that start with the letter ‘B’. Both transforms A and B have the same input PCollection.

A pipeline with multiple transforms. Note that the PCollection of table rows is processed by two transforms.

Figure 2: A pipeline with multiple transforms. Note that the PCollection of the database table rows is processed by two transforms.

A single transform that uses side outputs

Another way to branch a pipeline is to have a single transform output to multiple PCollections by using side outputs. Transforms that use side outputs, process each element of the input once, and allow you to output to zero or more PCollections.

Figure 3 below illustrates the same example described above, but with one transform that uses a side output; Names that start with ‘A’ are added to the output PCollection, and names that start with ‘B’ are added to the side output PCollection.

A pipeline with a transform that outputs multiple PCollections.

Figure 3: A pipeline with a transform that outputs multiple PCollections.

The pipeline in Figure 2 contains two transforms that process the elements in the same input PCollection. One transform uses the following logic pattern:

if (starts with 'A') { outputToPCollectionA }

while the other transform uses:

if (starts with 'B') { outputToPCollectionB }

Because each transform reads the entire input PCollection, each element in the input PCollection is processed twice.

The pipeline in Figure 3 performs the same operation in a different way - with only one transform that uses the logic

if (starts with 'A') { outputToPCollectionA } else if (starts with 'B') { outputToPCollectionB }

where each element in the input PCollection is processed once.

You can use either mechanism to produce multiple output PCollections. However, using side outputs makes more sense if the transform’s computation per element is time-consuming.

Merging PCollections

Often, after you’ve branched your PCollection into multiple PCollections via multiple transforms, you’ll want to merge some or all of those resulting PCollections back together. You can do so by using one of the following:

The example depicted in Figure 4 below is a continuation of the example illustrated in Figure 2 in the section above. After branching into two PCollections, one with names that begin with ‘A’ and one with names that begin with ‘B’, the pipeline merges the two together into a single PCollection that now contains all names that begin with either ‘A’ or ‘B’. Here, it makes sense to use Flatten because the PCollections being merged both contain the same type.

Part of a pipeline that merges multiple PCollections.

Figure 4: Part of a pipeline that merges multiple PCollections.

Multiple sources

Your pipeline can read its input from one or more sources. If your pipeline reads from multiple sources and the data from those sources is related, it can be useful to join the inputs together. In the example illustrated in Figure 5 below, the pipeline reads names and addresses from a database table, and names and order numbers from a text file. The pipeline then uses CoGroupByKey to join this information, where the key is the name; the resulting PCollection contains all the combinations of names, addresses, and orders.

A pipeline with multiple input sources.

Figure 5: A pipeline with multiple input sources.

What’s next