apache_beam.runners.runner module

PipelineRunner, an abstract base runner object.

class apache_beam.runners.runner.PipelineRunner[source]

Bases: future.types.newobject.newobject

A runner of a pipeline object.

The base runner provides a run() method for visiting every node in the pipeline’s DAG and executing the transforms computing the PValue in the node.

A custom runner will typically provide implementations for some of the transform methods (ParDo, GroupByKey, Create, etc.). It may also provide a new implementation for clear_pvalue(), which is used to wipe out materialized values in order to reduce footprint.

run(transform, options=None)[source]

Run the given transform or callable with this runner.

Blocks until the pipeline is complete. See also PipelineRunner.run_async.

run_async(transform, options=None)[source]

Run the given transform or callable with this runner.

May return immediately, executing the pipeline in the background. The returned result object can be queried for progress, and wait_until_finish may be called to block until completion.

run_pipeline(pipeline)[source]

Execute the entire pipeline or the sub-DAG reachable from a node.

Runners should override this method.

apply(transform, input)[source]

Runner callback for a pipeline.apply call.

Parameters:
  • transform – the transform to apply.
  • input – transform’s input (typically a PCollection).

A concrete implementation of the Runner class may want to do custom pipeline construction for a given transform. To override the behavior for a transform class Xyz, implement an apply_Xyz method with this same signature.

apply_PTransform(transform, input)[source]
run_transform(transform_node)[source]

Runner callback for a pipeline.run call.

Parameters:transform_node – transform node for the transform to run.

A concrete implementation of the Runner class must implement run_Abc for some class Abc in the method resolution order for every non-composite transform Xyz in the pipeline.

class apache_beam.runners.runner.PipelineState[source]

Bases: future.types.newobject.newobject

State of the Pipeline, as returned by PipelineResult.state.

This is meant to be the union of all the states any runner can put a pipeline in. Currently, it represents the values of the dataflow API JobState enum.

UNKNOWN = 'UNKNOWN'
STARTING = 'STARTING'
STOPPED = 'STOPPED'
RUNNING = 'RUNNING'
DONE = 'DONE'
FAILED = 'FAILED'
CANCELLED = 'CANCELLED'
UPDATED = 'UPDATED'
DRAINING = 'DRAINING'
DRAINED = 'DRAINED'
PENDING = 'PENDING'
CANCELLING = 'CANCELLING'
class apache_beam.runners.runner.PipelineResult(state)[source]

Bases: future.types.newobject.newobject

A PipelineResult provides access to info about a pipeline.

state

Return the current state of the pipeline execution.

wait_until_finish(duration=None)[source]

Waits until the pipeline finishes and returns the final status.

Parameters:

duration (int) – The time to wait (in milliseconds) for job to finish. If it is set to None, it will wait indefinitely until the job is finished.

Raises:
  • IOError – If there is a persistent problem getting job information.
  • NotImplementedError – If the runner does not support this operation.
Returns:

The final state of the pipeline, or None on timeout.

cancel()[source]

Cancels the pipeline execution.

Raises:
  • IOError – If there is a persistent problem getting job information.
  • NotImplementedError – If the runner does not support this operation.
Returns:

The final state of the pipeline.

metrics()[source]

Returns MetricResults object to query metrics from the runner.

Raises:NotImplementedError – If the runner does not support this operation.
aggregated_values(aggregator_or_name)[source]

Return a dict of step names to values of the Aggregator.