apache_beam.runners.interactive.caching.read_cache module

Module to read cache of computed PCollections.

For internal use only; no backward-compatibility guarantees.

class apache_beam.runners.interactive.caching.read_cache.ReadCache(pipeline: org.apache.beam.model.pipeline.v1.beam_runner_api_pb2.Pipeline, context: apache_beam.runners.pipeline_context.PipelineContext, cache_manager: apache_beam.runners.interactive.cache_manager.CacheManager, cacheable: apache_beam.runners.interactive.caching.cacheable.Cacheable)[source]

Bases: object

Class that facilitates reading cache of computed PCollections.

read_cache() → Tuple[str, str][source]

Reads cache of the cacheable PCollection and wires the cache into the pipeline proto. Returns the pipeline-scoped ids of the cacheable PCollection and the cache reading output PCollection that replaces it.

First, it creates a temporary pipeline instance on top of the existing component_id_map from the self._pipeline’s context so that both pipelines share the context and have no conflict component ids. Second, it instantiates a _ReadCacheTransform to build the temporary pipeline with a subgraph under top level transforms that reads the cache of a cacheable PCollection. Third, it copies components of the subgraph from the temporary pipeline to self._pipeline, skipping components that are not in the temporary pipeline but presents in the component_id_map of self._pipeline. Since to_runner_api generates components for all entries in the component_id_map, those component ids from the context shared by self._pipeline need to be ignored. Last, it replaces inputs of all transforms that consume the cacheable PCollection with the output PCollection of the _ReadCacheTransform so that the whole pipeline computes with data from the cache. The pipeline fragment of reading the cacheable PCollection is now disconnected from the rest of the pipeline and can be pruned later.