Apache Beam SDK for Python

Apache Beam provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.

The Apache Beam SDK for Python provides access to Apache Beam capabilities from the Python programming language.

Status

The SDK is still early in its development, and significant changes should be expected before the first stable version.

Overview

The key concepts in this programming model are

  • PCollection: represents a collection of data, which could be bounded or unbounded in size.
  • PTransform: represents a computation that transforms input PCollections into output PCollections.
  • Pipeline: manages a directed acyclic graph of PTransform s and PCollection s that is ready for execution.
  • PipelineRunner: specifies where and how the pipeline should execute.
  • Read: read from an external source.
  • Write: write to an external data sink.

Typical usage

At the top of your source file:

import apache_beam as beam

After this import statement

Examples

The examples subdirectory has some examples.

Subpackages