apache_beam.runners.direct package

Submodules

apache_beam.runners.direct.bundle_factory module

A factory that creates UncommittedBundles.

class apache_beam.runners.direct.bundle_factory.BundleFactory(stacked)[source]

Bases: object

For internal use only; no backwards-compatibility guarantees.

BundleFactory creates output bundles to be used by transform evaluators.

Parameters:stacked – whether or not to stack the WindowedValues within the bundle in case consecutive ones share the same timestamp and windows. DirectRunnerOptions.direct_runner_use_stacked_bundle controls this option.
create_bundle(output_pcollection)[source]
create_empty_committed_bundle(output_pcollection)[source]

apache_beam.runners.direct.clock module

Clock implementations for real time processing and testing.

class apache_beam.runners.direct.clock.Clock[source]

Bases: object

For internal use only; no backwards-compatibility guarantees.

time()[source]

Returns the number of milliseconds since epoch.

class apache_beam.runners.direct.clock.MockClock(now_in_ms)[source]

Bases: apache_beam.runners.direct.clock.Clock

For internal use only; no backwards-compatibility guarantees.

Mock clock implementation for testing.

advance(duration_in_ms)[source]
set_time(value_in_ms)[source]
time()[source]

apache_beam.runners.direct.consumer_tracking_pipeline_visitor module

ConsumerTrackingPipelineVisitor, a PipelineVisitor object.

class apache_beam.runners.direct.consumer_tracking_pipeline_visitor.ConsumerTrackingPipelineVisitor[source]

Bases: apache_beam.pipeline.PipelineVisitor

For internal use only; no backwards-compatibility guarantees.

Visitor for extracting value-consumer relations from the graph.

Tracks the AppliedPTransforms that consume each PValue in the Pipeline. This is used to schedule consuming PTransforms to consume input after the upstream transform has produced and committed output.

visit_transform(applied_ptransform)[source]

apache_beam.runners.direct.direct_metrics module

DirectRunner implementation of MetricResults. It is in charge not only of responding to queries of current metrics, but also of keeping the common state consistent.

class apache_beam.runners.direct.direct_metrics.DirectMetric(aggregator)[source]

Bases: object

Keeps a consistent state for a single metric.

It keeps track of the metric’s physical and logical updates. It’s thread safe.

commit_logical(bundle, update)[source]
commit_physical(bundle, update)[source]
extract_committed()[source]
extract_latest_attempted()[source]
update_physical(bundle, update)[source]
class apache_beam.runners.direct.direct_metrics.DirectMetrics[source]

Bases: apache_beam.metrics.metric.MetricResults

commit_logical(bundle, updates)[source]
commit_physical(bundle, updates)[source]
query(filter=None)[source]
update_physical(bundle, updates)[source]

apache_beam.runners.direct.direct_runner module

DirectRunner, executing on the local machine.

The DirectRunner is a runner implementation that executes the entire graph of transformations belonging to a pipeline on the local machine.

class apache_beam.runners.direct.direct_runner.DirectRunner[source]

Bases: apache_beam.runners.runner.PipelineRunner

Executes a single pipeline on the local machine.

apply_CombinePerKey(transform, pcoll)[source]
cache
run(pipeline)[source]

Execute the entire pipeline and returns an DirectPipelineResult.

apache_beam.runners.direct.evaluation_context module

EvaluationContext tracks global state, triggers and watermarks.

class apache_beam.runners.direct.evaluation_context.EvaluationContext(pipeline_options, bundle_factory, root_transforms, value_to_consumers, step_names, views)[source]

Bases: object

Evaluation context with the global state information of the pipeline.

The evaluation context for a specific pipeline being executed by the DirectRunner. Contains state shared within the execution across all transforms.

EvaluationContext contains shared state for an execution of the DirectRunner that can be used while evaluating a PTransform. This consists of views into underlying state and watermark implementations, access to read and write side inputs, and constructing counter sets and execution contexts. This includes executing callbacks asynchronously when state changes to the appropriate point (e.g. when a side input is requested and known to be empty).

EvaluationContext also handles results by committing finalizing bundles based on the current global state and updating the global state appropriately. This includes updating the per-(step,key) state, updating global watermarks, and executing any callbacks that can be executed.

append_to_cache(applied_ptransform, tag, elements)[source]
create_bundle(output_pcollection)[source]

Create an uncommitted bundle for the specified PCollection.

create_empty_committed_bundle(output_pcollection)[source]

Create empty bundle useful for triggering evaluation.

extract_fired_timers()[source]
get_aggregator_values(aggregator_or_name)[source]
get_execution_context(applied_ptransform)[source]
get_value_or_schedule_after_output(side_input, task)[source]
handle_result(completed_bundle, completed_timers, result)[source]

Handle the provided result produced after evaluating the input bundle.

Handle the provided TransformResult, produced after evaluating the provided committed bundle (potentially None, if the result of a root PTransform).

The result is the output of running the transform contained in the TransformResult on the contents of the provided bundle.

Parameters:
  • completed_bundle – the bundle that was processed to produce the result.
  • completed_timers – the timers that were delivered to produce the completed_bundle.
  • result – the TransformResult of evaluating the input bundle
Returns:

the committed bundles contained within the handled result.

has_cache
is_done(transform=None)[source]

Checks completion of a step or the pipeline.

Parameters:transform – AppliedPTransform to check for completion.
Returns:True if the step will not produce additional output. If transform is None returns true if all steps are done.
is_root_transform(applied_ptransform)[source]
metrics()[source]
schedule_pending_unblocked_tasks(executor_service)[source]
use_pvalue_cache(cache)[source]

apache_beam.runners.direct.executor module

An executor that schedules and executes applied ptransforms.

class apache_beam.runners.direct.executor.Executor(*args, **kwargs)[source]

Bases: object

For internal use only; no backwards-compatibility guarantees.

await_completion()[source]
start(roots)[source]
class apache_beam.runners.direct.executor.TransformExecutor(transform_evaluator_registry, evaluation_context, input_bundle, applied_transform, completion_callback, transform_evaluation_state)[source]

Bases: apache_beam.runners.direct.executor.CallableTask

For internal use only; no backwards-compatibility guarantees.

TransformExecutor will evaluate a bundle using an applied ptransform.

A CallableTask responsible for constructing a TransformEvaluator and evaluating it on some bundle of input, and registering the result using the completion callback.

call()[source]

apache_beam.runners.direct.helper_transforms module

class apache_beam.runners.direct.helper_transforms.FinishCombine(combine_fn)[source]

Bases: apache_beam.transforms.core.DoFn

Merges partially combined results.

default_type_hints()[source]
process(element)[source]
class apache_beam.runners.direct.helper_transforms.LiftedCombinePerKey(combine_fn, args, kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

An implementation of CombinePerKey that does mapper-side pre-combining.

expand(pcoll)[source]
class apache_beam.runners.direct.helper_transforms.PartialGroupByKeyCombiningValues(combine_fn)[source]

Bases: apache_beam.transforms.core.DoFn

Aggregates values into a per-key-window cache.

As bundles are in-memory-sized, we don’t bother flushing until the very end.

default_type_hints()[source]
finish_bundle()[source]
process(element, window='WindowParam')[source]
start_bundle()[source]

apache_beam.runners.direct.transform_evaluator module

An evaluator of a specific application of a transform.

class apache_beam.runners.direct.transform_evaluator.TransformEvaluatorRegistry(evaluation_context)[source]

Bases: object

For internal use only; no backwards-compatibility guarantees.

Creates instances of TransformEvaluator for the application of a transform.

for_application(applied_ptransform, input_committed_bundle, side_inputs, scoped_metrics_container)[source]

Returns a TransformEvaluator suitable for processing given inputs.

should_execute_serially(applied_ptransform)[source]

Returns True if this applied_ptransform should run one bundle at a time.

Some TransformEvaluators use a global state object to keep track of their global execution state. For example evaluator for _GroupByKeyOnly uses this state as an in memory dictionary to buffer keys.

Serially executed evaluators will act as syncing point in the graph and execution will not move forward until they receive all of their inputs. Once they receive all of their input, they will release the combined output. Their output may consist of multiple bundles as they may divide their output into pieces before releasing.

Parameters:applied_ptransform – Transform to be used for execution.
Returns:True if executor should execute applied_ptransform serially.

apache_beam.runners.direct.transform_result module

The result of evaluating an AppliedPTransform with a TransformEvaluator.

class apache_beam.runners.direct.transform_result.TransformResult(applied_ptransform, uncommitted_output_bundles, state, timer_update, counters, watermark_hold, undeclared_tag_values=None)[source]

Bases: object

For internal use only; no backwards-compatibility guarantees.

The result of evaluating an AppliedPTransform with a TransformEvaluator.

apache_beam.runners.direct.watermark_manager module

Manages watermarks of PCollections and AppliedPTransforms.

class apache_beam.runners.direct.watermark_manager.WatermarkManager(clock, root_transforms, value_to_consumers)[source]

Bases: object

For internal use only; no backwards-compatibility guarantees.

Tracks and updates watermarks for all AppliedPTransforms.

WATERMARK_NEG_INF = Timestamp(-9223372036854.775808)
WATERMARK_POS_INF = Timestamp(9223372036854.775807)
extract_fired_timers()[source]
get_watermarks(applied_ptransform)[source]

Gets the input and output watermarks for an AppliedPTransform.

If the applied_ptransform has not processed any elements, return a watermark with minimum value.

Parameters:applied_ptransform – AppliedPTransform to get the watermarks for.
Returns:A snapshot (TransformWatermarks) of the input watermark and output watermark for the provided transform.
update_watermarks(completed_committed_bundle, applied_ptransform, timer_update, outputs, earliest_hold)[source]

Module contents

Inprocess runner executes pipelines locally in a single process.

Anything in this package not imported here is an internal implementation detail with no backwards-compatibility guarantees.