Basics of the Beam model

Suppose you have a data processing engine that can pretty easily process graphs of operations. You want to integrate it with the Beam ecosystem to get access to other languages, great event time processing, and a library of connectors. You need to know the core vocabulary:

These concepts may be very similar to your processing engine’s concepts. Since Beam’s design is for cross-language operation and reusable libraries of transforms, there are some special features worth highlighting.

Pipeline

A pipeline in Beam is a graph of PTransforms operating on PCollections. A pipeline is constructed by a user in their SDK of choice, and makes its way to your runner either via the SDK directly or via the Runner API’s RPC interfaces.

PTransforms

A PTransform represents a data processing operation, or a step, in your pipeline. A PTransform can be applied to one or more PCollection objects as input which performs some processing on the elements of that PCollection and produces zero or more output PCollection objects.

PCollections

A PCollection is an unordered bag of elements. Your runner will be responsible for storing these elements. There are some major aspects of a PCollection to note:

Bounded vs Unbounded

A PCollection may be bounded or unbounded.

These derive from the intuitions of batch and stream processing, but the two are unified in Beam and bounded and unbounded PCollections can coexist in the same pipeline. If your runner can only support bounded PCollections, you’ll need to reject pipelines that contain unbounded PCollections. If your runner is only really targeting streams, there are adapters in our support code to convert everything to APIs targeting unbounded data.

Timestamps

Every element in a PCollection has a timestamp associated with it.

When you execute a primitive connector to some storage system, that connector is responsible for providing initial timestamps. Your runner will need to propagate and aggregate timestamps. If the timestamp is not important, as with certain batch processing jobs where elements do not denote events, they will be the minimum representable timestamp, often referred to colloquially as “negative infinity”.

Watermarks

Every PCollection has to have a watermark that estimates how complete the PCollection is.

The watermark is a guess that “we’ll never see an element with an earlier timestamp”. Sources of data are responsible for producing a watermark. Your runner needs to implement watermark propagation as PCollections are processed, merged, and partitioned.

The contents of a PCollection are complete when a watermark advances to “infinity”. In this manner, you may discover that an unbounded PCollection is finite.

Windowed elements

Every element in a PCollection resides in a window. No element resides in multiple windows (two elements can be equal except for their window, but they are not the same).

When elements are read from the outside world they arrive in the global window. When they are written to the outside world, they are effectively placed back into the global window (any writing transform that doesn’t take this perspective probably risks data loss).

A window has a maximum timestamp, and when the watermark exceeds this plus user-specified allowed lateness the window is expired. All data related to an expired window may be discarded at any time.

Coder

Every PCollection has a coder, a specification of the binary format of the elements.

In Beam, the user’s pipeline may be written in a language other than the language of the runner. There is no expectation that the runner can actually deserialize user data. So the Beam model operates principally on encoded data - “just bytes”. Each PCollection has a declared encoding for its elements, called a coder. A coder has a URN that identifies the encoding, and may have additional sub-coders (for example, a coder for lists may contain a coder for the elements of the list). Language-specific serialization techniques can, and frequently are used, but there are a few key formats - such as key-value pairs and timestamps - that are common so your runner can understand them.

Windowing Strategy

Every PCollection has a windowing strategy, a specification of essential information for grouping and triggering operations.

The details will be discussed below when we discuss the Window primitive, which sets up the windowing strategy, and GroupByKey primitive, which has behavior governed by the windowing strategy.

User-Defined Functions (UDFs)

Beam has seven varieties of user-defined function (UDF). A Beam pipeline may contain UDFs written in a language other than your runner, or even multiple languages in the same pipeline (see the Runner API) so the definitions are language-independent (see the Fn API).

The UDFs of Beam are:

The various types of user-defined functions will be described further alongside the PTransforms that use them.

Runner

The term “runner” is used for a couple of things. It generally refers to the software that takes a Beam pipeline and executes it somehow. Often, this is the translation code that you write. It usually also includes some customized operators for your data processing engine, and is sometimes used to refer to the full stack.

A runner has just a single method run(Pipeline). From here on, I will often use code font for proper nouns in our APIs, whether or not the identifiers match across all SDKs.

The run(Pipeline) method should be asynchronous and results in a PipelineResult which generally will be a job descriptor for your data processing engine, providing methods for checking its status, canceling it, and waiting for it to terminate.