apache_beam.runners.interactive.pipeline_analyzer module

Analyzes and modifies the pipeline that utilize the PCollection cache.

This module is experimental. No backwards-compatibility guarantees.

class apache_beam.runners.interactive.pipeline_analyzer.PipelineAnalyzer(cache_manager, pipeline_proto, underlying_runner, options=None, desired_cache_labels=None)[source]

Bases: object

Constructor of PipelineAnanlyzer.

Parameters:
  • cache_manager – (CacheManager)
  • pipeline_proto – (Pipeline proto)
  • underlying_runner – (PipelineRunner)
  • options – (PipelineOptions)
  • desired_cache_labels – (Set[str]) a set of labels of the PCollection queried by the user.
pipeline_proto_to_execute()[source]

Returns Pipeline proto to be executed.

top_level_referenced_pcollection_ids()[source]

Returns an array of top level referenced PCollection IDs.

top_level_required_transforms()[source]

Returns a dict mapping ID to proto of top level PTransforms.

caches_used()[source]

Returns a set of PCollection IDs to read from cache.

class apache_beam.runners.interactive.pipeline_analyzer.PipelineInfo(proto)[source]

Bases: object

Provides access to pipeline metadata.

all_pcollections()[source]
leaf_pcollections()[source]
producer(pcoll_id)[source]
cache_label(pcoll_id)[source]

Returns the cache label given the PCollection ID.

class Derivation(inputs, transform_proto, output_tag)[source]

Bases: object

Records derivation info of a PCollection. Helper for PipelineInfo.

Constructor of Derivation.

Parameters:
  • inputs – (Dict[str, Derivation]) maps PCollection names to Derivations.
  • transform_proto – (Transform proto) the producing PTransform.
  • output_tag – (str) local name of the PCollection in analysis.
cache_label()[source]
json()[source]
apache_beam.runners.interactive.pipeline_analyzer.set_proto_map(proto_map, new_value)[source]