apache_beam.runners.interactive.pipeline_instrument module

Module to instrument interactivity to the given pipeline.

For internal use only; no backwards-compatibility guarantees. This module accesses current interactive environment and analyzes given pipeline to transform original pipeline into a one-shot pipeline with interactivity.

class apache_beam.runners.interactive.pipeline_instrument.PipelineInstrument(pipeline, options=None)[source]

Bases: object

A pipeline instrument for pipeline to be executed by interactive runner.

This module should never depend on underlying runner that interactive runner delegates. It instruments the original instance of pipeline directly by appending or replacing transforms with help of cache. It provides interfaces to recover states of original pipeline. It’s the interactive runner’s responsibility to coordinate supported underlying runners to run the pipeline instrumented and recover the original pipeline states if needed.

instrumented_pipeline_proto()[source]

Always returns a new instance of portable instrumented proto.

has_unbounded_source

Checks if a given pipeline has any source that is unbounded.

The function directly checks the source transform definition instead of pvalues in the pipeline. Thus manually setting is_bounded field of a PCollection or switching streaming mode will not affect this function’s result. The result is always deterministic when the source code of a pipeline is defined.

cacheables

Finds cacheable PCollections from the pipeline.

The function only treats the result as cacheables since there is no guarantee whether the cache desired PCollection has been cached or not. A PCollection desires caching when it’s bound to a user defined variable in source code. Otherwise, the PCollection is not reusale nor introspectable which nullifying the need of cache.

pcolls_to_pcoll_id

Returns a dict mapping str(PCollection)s to IDs.

original_pipeline_proto

Returns the portable proto representation of the pipeline before instrumentation.

original_pipeline

Returns a snapshot of the pipeline before instrumentation.

instrument()[source]

Instruments original pipeline with cache.

For cacheable output PCollection, if cache for the key doesn’t exist, do _write_cache(); for cacheable input PCollection, if cache for the key exists, do _read_cache(). No instrument in any other situation.

Modifies:
self._pipeline
cache_key(pcoll)[source]

Gets the identifier of a cacheable PCollection in cache.

If the pcoll is not a cacheable, return ‘’. The key is what the pcoll would use as identifier if it’s materialized in cache. It doesn’t mean that there would definitely be such cache already. Also, the pcoll can come from the original user defined pipeline object or an equivalent pcoll from a transformed copy of the original pipeline.

apache_beam.runners.interactive.pipeline_instrument.pin(pipeline, options=None)[source]

Creates PipelineInstrument for a pipeline and its options with cache.

apache_beam.runners.interactive.pipeline_instrument.cacheables(pcolls_to_pcoll_id)[source]

Finds cache desired PCollections from the instrumented pipeline.

The function only treats the result as cacheables since whether the cache desired PCollection has been cached depends on whether the pipeline has been executed in current interactive environment. A PCollection desires caching when it’s bound to a user defined variable in source code. Otherwise, the PCollection is not reusable nor introspectable which nullifies the need of cache. There might be multiple pipelines defined and watched, this will return for PCollections from the ones with pcolls_to_pcoll_id analyzed. The check is not strict because pcoll_id is not unique across multiple pipelines. Additional check needs to be done during instrument.

apache_beam.runners.interactive.pipeline_instrument.cacheable_key(pcoll, pcolls_to_pcoll_id, pcoll_version_map=None)[source]
apache_beam.runners.interactive.pipeline_instrument.has_unbounded_source(pipeline)[source]

Checks if a given pipeline has any source that is unbounded.

apache_beam.runners.interactive.pipeline_instrument.pcolls_to_pcoll_id(pipeline, original_context)[source]

Returns a dict mapping PCollections string to PCollection IDs.

Using a PipelineVisitor to iterate over every node in the pipeline, records the mapping from PCollections to PCollections IDs. This mapping will be used to query cached PCollections.

Returns:(dict from str to str) a dict mapping str(pcoll) to pcoll_id.