apache_beam package

Subpackages

Submodules

apache_beam.error module

Python Dataflow error classes.

exception apache_beam.error.BeamError[source]

Bases: exceptions.Exception

Base class for all Beam errors.

exception apache_beam.error.PValueError[source]

Bases: apache_beam.error.BeamError

An error related to a PValue object (e.g. value is not computed).

exception apache_beam.error.PipelineError[source]

Bases: apache_beam.error.BeamError

An error in the pipeline object (e.g. a PValue not linked to it).

exception apache_beam.error.RunnerError[source]

Bases: apache_beam.error.BeamError

An error related to a Runner object (e.g. cannot find a runner to run).

exception apache_beam.error.RuntimeValueProviderError[source]

Bases: exceptions.RuntimeError

An error related to a ValueProvider object raised during runtime.

exception apache_beam.error.SideInputError[source]

Bases: apache_beam.error.BeamError

An error related to a side input to a parallel Do operation.

exception apache_beam.error.TransformError[source]

Bases: apache_beam.error.BeamError

An error related to a PTransform object.

apache_beam.pipeline module

Pipeline, the top-level Dataflow object.

A pipeline holds a DAG of data transforms. Conceptually the nodes of the DAG are transforms (PTransform objects) and the edges are values (mostly PCollection objects). The transforms take as inputs one or more PValues and output one or more PValues.

The pipeline offers functionality to traverse the graph. The actual operation to be executed for each node visited is specified through a runner object.

Typical usage:

# Create a pipeline object using a local runner for execution. p = beam.Pipeline(‘DirectRunner’)

# Add to the pipeline a “Create” transform. When executed this # transform will produce a PCollection object with the specified values. pcoll = p | ‘Create’ >> beam.Create([1, 2, 3])

# Another transform could be applied to pcoll, e.g., writing to a text file. # For other transforms, refer to transforms/ directory. pcoll | ‘Write’ >> beam.io.WriteToText(‘./output’)

# run() will execute the DAG stored in the pipeline. The execution of the # nodes visited is done using the specified local runner. p.run()

class apache_beam.pipeline.Pipeline(runner=None, options=None, argv=None)[source]

Bases: object

A pipeline object that manages a DAG of PValues and their PTransforms.

Conceptually the PValues are the DAG’s nodes and the PTransforms computing the PValues are the edges.

All the transforms applied to the pipeline must have distinct full labels. If same transform instance needs to be applied then the right shift operator should be used to designate new names (e.g. input | “label” >> my_tranform).

apply(transform, pvalueish=None, label=None)[source]

Applies a custom transform using the pvalueish specified.

Parameters:
  • transform – the PTranform to apply.
  • pvalueish – the input for the PTransform (typically a PCollection).
  • label – label of the PTransform.
Raises:
  • TypeError – if the transform object extracted from the argument list is not a PTransform.
  • RuntimeError – if the transform object was already applied to this pipeline and needs to be cloned in order to apply again.
static from_runner_api(proto, runner, options)[source]

For internal use only; no backwards-compatibility guarantees.

options
run(test_runner_api=True)[source]

Runs the pipeline. Returns whatever our runner returns after running.

to_runner_api()[source]

For internal use only; no backwards-compatibility guarantees.

visit(visitor)[source]

Visits depth-first every node of a pipeline’s DAG.

Runner-internal implementation detail; no backwards-compatibility guarantees

Parameters:

visitor – PipelineVisitor object whose callbacks will be called for each node visited. See PipelineVisitor comments.

Raises:
  • TypeError – if node is specified and is not a PValue.
  • pipeline.PipelineError – if node is specified and does not belong to this pipeline instance.

apache_beam.pvalue module

PValue, PCollection: one node of a dataflow graph.

A node of a dataflow processing graph is a PValue. Currently, there is only one type: PCollection (a potentially very large set of arbitrary values). Once created, a PValue belongs to a pipeline and has an associated transform (of type PTransform), which describes how the value will be produced when the pipeline gets executed.

class apache_beam.pvalue.PCollection(pipeline, tag=None, element_type=None)[source]

Bases: apache_beam.pvalue.PValue

A multiple values (potentially huge) container.

Dataflow users should not construct PCollection objects directly in their pipelines.

static from_runner_api(proto, context)[source]
to_runner_api(context)[source]
windowing
class apache_beam.pvalue.TaggedOutput(tag, value)[source]

Bases: object

An object representing a tagged value.

ParDo, Map, and FlatMap transforms can emit values on multiple outputs which are distinguished by string tags. The DoFn will return plain values if it wants to emit on the main output and TaggedOutput objects if it wants to emit a value on a specific tagged output.

class apache_beam.pvalue.AsSingleton(pcoll, default_value=<object object>)[source]

Bases: apache_beam.pvalue.AsSideInput

Marker specifying that an entire PCollection is to be used as a side input.

When a PCollection is supplied as a side input to a PTransform, it is necessary to indicate whether the entire PCollection should be made available as a PTransform side argument (in the form of an iterable), or whether just one value should be pulled from the PCollection and supplied as the side argument (as an ordinary value).

Wrapping a PCollection side input argument to a PTransform in this container (e.g., data.apply(‘label’, MyPTransform(), AsSingleton(my_side_input) ) selects the latter behavor.

The input PCollection must contain exactly one value per window, unless a default is given, in which case it may be empty.

element_type
class apache_beam.pvalue.AsIter(pcoll)[source]

Bases: apache_beam.pvalue.AsSideInput

Marker specifying that an entire PCollection is to be used as a side input.

When a PCollection is supplied as a side input to a PTransform, it is necessary to indicate whether the entire PCollection should be made available as a PTransform side argument (in the form of an iterable), or whether just one value should be pulled from the PCollection and supplied as the side argument (as an ordinary value).

Wrapping a PCollection side input argument to a PTransform in this container (e.g., data.apply(‘label’, MyPTransform(), AsIter(my_side_input) ) selects the former behavor.

element_type
class apache_beam.pvalue.AsList(pcoll)[source]

Bases: apache_beam.pvalue.AsSideInput

Marker specifying that an entire PCollection is to be used as a side input.

Intended for use in side-argument specification—the same places where AsSingleton and AsIter are used, but forces materialization of this PCollection as a list.

Parameters:pcoll – Input pcollection.
Returns:An AsList-wrapper around a PCollection whose one element is a list containing all elements in pcoll.
class apache_beam.pvalue.AsDict(pcoll)[source]

Bases: apache_beam.pvalue.AsSideInput

Marker specifying a PCollection to be used as an indexable side input.

Intended for use in side-argument specification—the same places where AsSingleton and AsIter are used, but returns an interface that allows key lookup.

Parameters:pcoll – Input pcollection. All elements should be key-value pairs (i.e. 2-tuples) with unique keys.
Returns:
An AsDict-wrapper around a PCollection whose one element is a dict with
entries for uniquely-keyed pairs in pcoll.
class apache_beam.pvalue.EmptySideInput[source]

Bases: object

Value indicating when a singleton side input was empty.

If a PCollection was furnished as a singleton side input to a PTransform, and that PCollection was empty, then this value is supplied to the DoFn in the place where a value from a non-empty PCollection would have gone. This alerts the DoFn that the side input PCollection was empty. Users may want to check whether side input values are EmptySideInput, but they will very likely never want to create new instances of this class themselves.

apache_beam.version module

Apache Beam SDK version information and utilities.

Module contents

Apache Beam SDK for Python.

Apache Beam <https://beam.apache.org/> provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.

The Apache Beam SDK for Python provides access to Apache Beam capabilities from the Python programming language.

Status

The SDK is still early in its development, and significant changes should be expected before the first stable version.

Overview

The key concepts in this programming model are

  • PCollection: represents a collection of data, which could be bounded or unbounded in size.
  • PTransform: represents a computation that transforms input PCollections into output PCollections.
  • Pipeline: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.
  • Runner: specifies where and how the Pipeline should execute.
  • Reading and Writing Data: your pipeline can read from an external source and write to an external data sink.

Typical usage

At the top of your source file:

import apache_beam as beam

After this import statement

  • transform classes are available as beam.FlatMap, beam.GroupByKey, etc.
  • Pipeline class is available as beam.Pipeline
  • text read/write transforms are available as beam.io.ReadfromText, beam.io.WriteToText

Examples

The examples subdirectory has some examples.