apache_beam.pvalue module

PValue, PCollection: one node of a dataflow graph.

A node of a dataflow processing graph is a PValue. Currently, there is only one type: PCollection (a potentially very large set of arbitrary values). Once created, a PValue belongs to a pipeline and has an associated transform (of type PTransform), which describes how the value will be produced when the pipeline gets executed.

class apache_beam.pvalue.PCollection(pipeline, tag=None, element_type=None, windowing=None, is_bounded=True)[source]

Bases: apache_beam.pvalue.PValue, typing.Generic

A multiple values (potentially huge) container.

Dataflow users should not construct PCollection objects directly in their pipelines.

Initializes a PValue with all arguments hidden behind keyword arguments.

Parameters:
  • pipeline – Pipeline object for this PValue.
  • tag – Tag of this PValue.
  • element_type – The type of this PValue.
windowing
static from_(pcoll, is_bounded=None)[source]

Create a PCollection, using another PCollection as a starting point.

Transfers relevant attributes.

to_runner_api(context)[source]
static from_runner_api(proto, context)[source]
class apache_beam.pvalue.TaggedOutput(tag, value)[source]

Bases: object

An object representing a tagged value.

ParDo, Map, and FlatMap transforms can emit values on multiple outputs which are distinguished by string tags. The DoFn will return plain values if it wants to emit on the main output and TaggedOutput objects if it wants to emit a value on a specific tagged output.

class apache_beam.pvalue.AsSideInput(pcoll)[source]

Bases: object

Marker specifying that a PCollection will be used as a side input.

When a PCollection is supplied as a side input to a PTransform, it is necessary to indicate how the PCollection should be made available as a PTransform side argument (e.g. in the form of an iterable, mapping, or single value). This class is the superclass of all the various options, and should not be instantiated directly. (See instead AsSingleton, AsIter, etc.)

element_type
to_runner_api(context)[source]
static from_runner_api(proto, context)[source]
requires_keyed_input()[source]
class apache_beam.pvalue.AsSingleton(pcoll, default_value=<object object>)[source]

Bases: apache_beam.pvalue.AsSideInput

Marker specifying that an entire PCollection is to be used as a side input.

When a PCollection is supplied as a side input to a PTransform, it is necessary to indicate whether the entire PCollection should be made available as a PTransform side argument (in the form of an iterable), or whether just one value should be pulled from the PCollection and supplied as the side argument (as an ordinary value).

Wrapping a PCollection side input argument to a PTransform in this container (e.g., data.apply(‘label’, MyPTransform(), AsSingleton(my_side_input) ) selects the latter behavior.

The input PCollection must contain exactly one value per window, unless a default is given, in which case it may be empty.

element_type
class apache_beam.pvalue.AsIter(pcoll)[source]

Bases: apache_beam.pvalue.AsSideInput

Marker specifying that an entire PCollection is to be used as a side input.

When a PCollection is supplied as a side input to a PTransform, it is necessary to indicate whether the entire PCollection should be made available as a PTransform side argument (in the form of an iterable), or whether just one value should be pulled from the PCollection and supplied as the side argument (as an ordinary value).

Wrapping a PCollection side input argument to a PTransform in this container (e.g., data.apply(‘label’, MyPTransform(), AsIter(my_side_input) ) selects the former behavor.

element_type
class apache_beam.pvalue.AsList(pcoll)[source]

Bases: apache_beam.pvalue.AsSideInput

Marker specifying that an entire PCollection is to be used as a side input.

Intended for use in side-argument specification—the same places where AsSingleton and AsIter are used, but forces materialization of this PCollection as a list.

Parameters:pcoll – Input pcollection.
Returns:An AsList-wrapper around a PCollection whose one element is a list containing all elements in pcoll.
class apache_beam.pvalue.AsDict(pcoll)[source]

Bases: apache_beam.pvalue.AsSideInput

Marker specifying a PCollection to be used as an indexable side input.

Intended for use in side-argument specification—the same places where AsSingleton and AsIter are used, but returns an interface that allows key lookup.

Parameters:pcoll – Input pcollection. All elements should be key-value pairs (i.e. 2-tuples) with unique keys.
Returns:
An AsDict-wrapper around a PCollection whose one element is a dict with
entries for uniquely-keyed pairs in pcoll.
class apache_beam.pvalue.AsMultiMap(pcoll)[source]

Bases: apache_beam.pvalue.AsSideInput

Marker specifying a PCollection to be used as an indexable side input.

Similar to AsDict, but multiple values may be associated per key, and the keys are fetched lazily rather than all having to fit in memory.

Intended for use in side-argument specification—the same places where AsSingleton and AsIter are used, but returns an interface that allows key lookup.

requires_keyed_input()[source]
class apache_beam.pvalue.EmptySideInput[source]

Bases: object

Value indicating when a singleton side input was empty.

If a PCollection was furnished as a singleton side input to a PTransform, and that PCollection was empty, then this value is supplied to the DoFn in the place where a value from a non-empty PCollection would have gone. This alerts the DoFn that the side input PCollection was empty. Users may want to check whether side input values are EmptySideInput, but they will very likely never want to create new instances of this class themselves.

class apache_beam.pvalue.Row(**kwargs)[source]

Bases: object

A dynamic schema’d row object.

This objects attributes are initialized from the keywords passed into its constructor, e.g. Row(x=3, y=4) will create a Row with two attributes x and y.

More importantly, when a Row object is returned from a Map, FlatMap, or DoFn type inference is able to deduce the schema of the resulting PCollection, e.g.

pc | beam.Map(lambda x: Row(x=x, y=0.5 * x))

when applied to a PCollection of ints will produce a PCollection with schema (x=int, y=float).

Note that in Beam 2.30.0 and later, Row objects are sensitive to field order. So Row(x=3, y=4) is not considered equal to Row(y=4, x=3).

as_dict()[source]