apache_beam package¶
Subpackages¶
- apache_beam.coders package
- Submodules
- apache_beam.coders.coder_impl module
- apache_beam.coders.coders module
- apache_beam.coders.coders_test_common module
- apache_beam.coders.observable module
- apache_beam.coders.proto2_coder_test_messages_pb2 module
- apache_beam.coders.slow_stream module
- apache_beam.coders.stream module
- apache_beam.coders.typecoders module
- Module contents
- apache_beam.internal package
- apache_beam.io package
- Subpackages
- apache_beam.io.gcp package
- Submodules
- apache_beam.io.avroio module
- apache_beam.io.concat_source module
- apache_beam.io.filebasedsink module
- apache_beam.io.filebasedsource module
- apache_beam.io.filesystem module
- apache_beam.io.filesystems module
- apache_beam.io.iobase module
- apache_beam.io.localfilesystem module
- apache_beam.io.range_trackers module
- apache_beam.io.source_test_utils module
- apache_beam.io.textio module
- apache_beam.io.tfrecordio module
- Module contents
- Subpackages
- apache_beam.metrics package
- apache_beam.options package
- apache_beam.portability package
- Subpackages
- apache_beam.portability.api package
- Submodules
- apache_beam.portability.api.beam_fn_api_pb2 module
- apache_beam.portability.api.beam_fn_api_pb2_grpc module
- apache_beam.portability.api.beam_runner_api_pb2 module
- apache_beam.portability.api.beam_runner_api_pb2_grpc module
- apache_beam.portability.api.standard_window_fns_pb2 module
- apache_beam.portability.api.standard_window_fns_pb2_grpc module
- Module contents
- apache_beam.portability.api package
- Module contents
- Subpackages
- apache_beam.runners package
- Subpackages
- apache_beam.runners.dataflow package
- Subpackages
- Submodules
- apache_beam.runners.dataflow.dataflow_metrics module
- apache_beam.runners.dataflow.dataflow_runner module
- apache_beam.runners.dataflow.ptransform_overrides module
- apache_beam.runners.dataflow.test_dataflow_runner module
- Module contents
- apache_beam.runners.direct package
- Submodules
- apache_beam.runners.direct.bundle_factory module
- apache_beam.runners.direct.clock module
- apache_beam.runners.direct.consumer_tracking_pipeline_visitor module
- apache_beam.runners.direct.direct_metrics module
- apache_beam.runners.direct.direct_runner module
- apache_beam.runners.direct.evaluation_context module
- apache_beam.runners.direct.executor module
- apache_beam.runners.direct.helper_transforms module
- apache_beam.runners.direct.transform_evaluator module
- apache_beam.runners.direct.util module
- apache_beam.runners.direct.watermark_manager module
- Module contents
- apache_beam.runners.dataflow package
- Submodules
- apache_beam.runners.common module
- apache_beam.runners.pipeline_context module
- apache_beam.runners.runner module
- Module contents
- Subpackages
- apache_beam.testing package
- apache_beam.transforms package
- Submodules
- apache_beam.transforms.combiners module
- apache_beam.transforms.core module
- apache_beam.transforms.cy_combiners module
- apache_beam.transforms.display module
- apache_beam.transforms.ptransform module
- apache_beam.transforms.sideinputs module
- apache_beam.transforms.timeutil module
- apache_beam.transforms.trigger module
- apache_beam.transforms.util module
- apache_beam.transforms.window module
- Module contents
- apache_beam.typehints package
- apache_beam.utils package
- Submodules
- apache_beam.utils.annotations module
- apache_beam.utils.counters module
- apache_beam.utils.plugin module
- apache_beam.utils.processes module
- apache_beam.utils.profiler module
- apache_beam.utils.proto_utils module
- apache_beam.utils.retry module
- apache_beam.utils.timestamp module
- apache_beam.utils.urns module
- apache_beam.utils.windowed_value module
- Module contents
Submodules¶
apache_beam.error module¶
Python Dataflow error classes.
-
exception
apache_beam.error.
BeamError
[source]¶ Bases:
exceptions.Exception
Base class for all Beam errors.
-
exception
apache_beam.error.
PValueError
[source]¶ Bases:
apache_beam.error.BeamError
An error related to a PValue object (e.g. value is not computed).
-
exception
apache_beam.error.
PipelineError
[source]¶ Bases:
apache_beam.error.BeamError
An error in the pipeline object (e.g. a PValue not linked to it).
-
exception
apache_beam.error.
RunnerError
[source]¶ Bases:
apache_beam.error.BeamError
An error related to a Runner object (e.g. cannot find a runner to run).
-
exception
apache_beam.error.
RuntimeValueProviderError
[source]¶ Bases:
exceptions.RuntimeError
An error related to a ValueProvider object raised during runtime.
-
exception
apache_beam.error.
SideInputError
[source]¶ Bases:
apache_beam.error.BeamError
An error related to a side input to a parallel Do operation.
-
exception
apache_beam.error.
TransformError
[source]¶ Bases:
apache_beam.error.BeamError
An error related to a PTransform object.
apache_beam.pipeline module¶
Pipeline, the top-level Dataflow object.
A pipeline holds a DAG of data transforms. Conceptually the nodes of the DAG are transforms (PTransform objects) and the edges are values (mostly PCollection objects). The transforms take as inputs one or more PValues and output one or more PValues.
The pipeline offers functionality to traverse the graph. The actual operation to be executed for each node visited is specified through a runner object.
Typical usage:
# Create a pipeline object using a local runner for execution. with beam.Pipeline(‘DirectRunner’) as p:
# Add to the pipeline a “Create” transform. When executed this # transform will produce a PCollection object with the specified values. pcoll = p | ‘Create’ >> beam.Create([1, 2, 3])
# Another transform could be applied to pcoll, e.g., writing to a text file. # For other transforms, refer to transforms/ directory. pcoll | ‘Write’ >> beam.io.WriteToText(‘./output’)
# run() will execute the DAG stored in the pipeline. The execution of the # nodes visited is done using the specified local runner.
-
class
apache_beam.pipeline.
Pipeline
(runner=None, options=None, argv=None)[source]¶ Bases:
object
A pipeline object that manages a DAG of PValues and their PTransforms.
Conceptually the PValues are the DAG’s nodes and the PTransforms computing the PValues are the edges.
All the transforms applied to the pipeline must have distinct full labels. If same transform instance needs to be applied then the right shift operator should be used to designate new names (e.g. input | “label” >> my_tranform).
-
apply
(transform, pvalueish=None, label=None)[source]¶ Applies a custom transform using the pvalueish specified.
Parameters: - transform – the PTranform to apply.
- pvalueish – the input for the PTransform (typically a PCollection).
- label – label of the PTransform.
Raises: TypeError
– if the transform object extracted from the argument list is not a PTransform.RuntimeError
– if the transform object was already applied to this pipeline and needs to be cloned in order to apply again.
-
static
from_runner_api
(proto, runner, options)[source]¶ For internal use only; no backwards-compatibility guarantees.
-
options
¶
-
replace_all
(replacements)[source]¶ Dynamically replaces PTransforms in the currently populated hierarchy.
Currently this only works for replacements where input and output types are exactly the same. TODO: Update this to also work for transform overrides where input and output types are different.Parameters: a list of PTransformOverride objects. (replacements) –
-
run
(test_runner_api=True)[source]¶ Runs the pipeline. Returns whatever our runner returns after running.
-
visit
(visitor)[source]¶ Visits depth-first every node of a pipeline’s DAG.
Runner-internal implementation detail; no backwards-compatibility guarantees
Parameters: visitor – PipelineVisitor object whose callbacks will be called for each node visited. See PipelineVisitor comments.
Raises: TypeError
– if node is specified and is not a PValue.pipeline.PipelineError
– if node is specified and does not belong to this pipeline instance.
-
apache_beam.pvalue module¶
PValue, PCollection: one node of a dataflow graph.
A node of a dataflow processing graph is a PValue. Currently, there is only one type: PCollection (a potentially very large set of arbitrary values). Once created, a PValue belongs to a pipeline and has an associated transform (of type PTransform), which describes how the value will be produced when the pipeline gets executed.
-
class
apache_beam.pvalue.
PCollection
(pipeline, tag=None, element_type=None)[source]¶ Bases:
apache_beam.pvalue.PValue
A multiple values (potentially huge) container.
Dataflow users should not construct PCollection objects directly in their pipelines.
-
windowing
¶
-
-
class
apache_beam.pvalue.
TaggedOutput
(tag, value)[source]¶ Bases:
object
An object representing a tagged value.
ParDo, Map, and FlatMap transforms can emit values on multiple outputs which are distinguished by string tags. The DoFn will return plain values if it wants to emit on the main output and TaggedOutput objects if it wants to emit a value on a specific tagged output.
-
class
apache_beam.pvalue.
AsSingleton
(pcoll, default_value=<object object>)[source]¶ Bases:
apache_beam.pvalue.AsSideInput
Marker specifying that an entire PCollection is to be used as a side input.
When a PCollection is supplied as a side input to a PTransform, it is necessary to indicate whether the entire PCollection should be made available as a PTransform side argument (in the form of an iterable), or whether just one value should be pulled from the PCollection and supplied as the side argument (as an ordinary value).
Wrapping a PCollection side input argument to a PTransform in this container (e.g., data.apply(‘label’, MyPTransform(), AsSingleton(my_side_input) ) selects the latter behavor.
The input PCollection must contain exactly one value per window, unless a default is given, in which case it may be empty.
-
element_type
¶
-
-
class
apache_beam.pvalue.
AsIter
(pcoll)[source]¶ Bases:
apache_beam.pvalue.AsSideInput
Marker specifying that an entire PCollection is to be used as a side input.
When a PCollection is supplied as a side input to a PTransform, it is necessary to indicate whether the entire PCollection should be made available as a PTransform side argument (in the form of an iterable), or whether just one value should be pulled from the PCollection and supplied as the side argument (as an ordinary value).
Wrapping a PCollection side input argument to a PTransform in this container (e.g., data.apply(‘label’, MyPTransform(), AsIter(my_side_input) ) selects the former behavor.
-
element_type
¶
-
-
class
apache_beam.pvalue.
AsList
(pcoll)[source]¶ Bases:
apache_beam.pvalue.AsSideInput
Marker specifying that an entire PCollection is to be used as a side input.
Intended for use in side-argument specification—the same places where AsSingleton and AsIter are used, but forces materialization of this PCollection as a list.
Parameters: pcoll – Input pcollection. Returns: An AsList-wrapper around a PCollection whose one element is a list containing all elements in pcoll.
-
class
apache_beam.pvalue.
AsDict
(pcoll)[source]¶ Bases:
apache_beam.pvalue.AsSideInput
Marker specifying a PCollection to be used as an indexable side input.
Intended for use in side-argument specification—the same places where AsSingleton and AsIter are used, but returns an interface that allows key lookup.
Parameters: pcoll – Input pcollection. All elements should be key-value pairs (i.e. 2-tuples) with unique keys. Returns: - An AsDict-wrapper around a PCollection whose one element is a dict with
- entries for uniquely-keyed pairs in pcoll.
-
class
apache_beam.pvalue.
EmptySideInput
[source]¶ Bases:
object
Value indicating when a singleton side input was empty.
If a PCollection was furnished as a singleton side input to a PTransform, and that PCollection was empty, then this value is supplied to the DoFn in the place where a value from a non-empty PCollection would have gone. This alerts the DoFn that the side input PCollection was empty. Users may want to check whether side input values are EmptySideInput, but they will very likely never want to create new instances of this class themselves.
apache_beam.version module¶
Apache Beam SDK version information and utilities.
Module contents¶
Apache Beam SDK for Python.
Apache Beam <https://beam.apache.org/> provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.
The Apache Beam SDK for Python provides access to Apache Beam capabilities from the Python programming language.
Status¶
The SDK is still early in its development, and significant changes should be expected before the first stable version.
Overview¶
The key concepts in this programming model are
- PCollection: represents a collection of data, which could be bounded or unbounded in size.
- PTransform: represents a computation that transforms input PCollections into output PCollections.
- Pipeline: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.
- Runner: specifies where and how the Pipeline should execute.
- Reading and Writing Data: your pipeline can read from an external source and write to an external data sink.
Typical usage¶
At the top of your source file:
import apache_beam as beam
After this import statement
- transform classes are available as beam.FlatMap, beam.GroupByKey, etc.
- Pipeline class is available as beam.Pipeline
- text read/write transforms are available as beam.io.ReadfromText, beam.io.WriteToText
Examples
The examples subdirectory has some examples.