apache_beam.runners.interactive.cache_manager module

class apache_beam.runners.interactive.cache_manager.CacheManager[source]

Bases: object

Abstract class for caching PCollections.

A PCollection cache is identified by labels, which consist of a prefix (either ‘full’ or ‘sample’) and a cache_label which is a hash of the PCollection derivation.

exists(*labels)[source]

Returns if the PCollection cache exists.

is_latest_version(version, *labels)[source]

Returns if the given version number is the latest.

read(*labels)[source]

Return the PCollection as a list as well as the version number.

Parameters:*labels – List of labels for PCollection instance.
Returns:
A tuple containing a list of items in the
PCollection and the version number.
Return type:Tuple[List[Any], int]

It is possible that the version numbers from read() and_latest_version() are different. This usually means that the cache’s been evicted (thus unavailable => read() returns version = -1), but it had reached version n before eviction.

source(*labels)[source]

Returns a beam.io.Source that reads the PCollection cache.

sink(*labels)[source]

Returns a beam.io.Sink that writes the PCollection cache.

save_pcoder(pcoder, *labels)[source]

Saves pcoder for given PCollection.

Correct reading of PCollection from Cache requires PCoder to be known. This method saves desired PCoder for PCollection that will subsequently be used by sink(…), source(…), and, most importantly, read(…) method. The latter must be able to read a PCollection written by Beam using non-Beam IO.

Parameters:
  • pcoder – A PCoder to be used for reading and writing a PCollection.
  • *labels – List of labels for PCollection instance.
load_pcoder(*labels)[source]

Returns previously saved PCoder for reading and writing PCollection.

cleanup()[source]

Cleans up all the PCollection caches.

class apache_beam.runners.interactive.cache_manager.FileBasedCacheManager(cache_dir=None, cache_format='text')[source]

Bases: apache_beam.runners.interactive.cache_manager.CacheManager

Maps PCollections to local temp files for materialization.

exists(*labels)[source]
save_pcoder(pcoder, *labels)[source]
load_pcoder(*labels)[source]
read(*labels)[source]
source(*labels)[source]
sink(*labels)[source]
cleanup()[source]
class apache_beam.runners.interactive.cache_manager.ReadCache(cache_manager, label)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform that reads the PCollections from the cache.

expand(pbegin)[source]
class apache_beam.runners.interactive.cache_manager.WriteCache(cache_manager, label, sample=False, sample_size=0)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform that writes the PCollections to the cache.

expand(pcoll)[source]
class apache_beam.runners.interactive.cache_manager.SafeFastPrimitivesCoder[source]

Bases: apache_beam.coders.coders.Coder

This class add an quote/unquote step to escape special characters.

encode(value)[source]
decode(value)[source]