apache_beam.runners.interactive.cache_manager module

class apache_beam.runners.interactive.cache_manager.CacheManager[source]

Bases: object

Abstract class for caching PCollections.

A PCollection cache is identified by labels, which consist of a prefix (either ‘full’ or ‘sample’) and a cache_label which is a hash of the PCollection derivation.

exists(*labels)[source]

Returns if the PCollection cache exists.

is_latest_version(version, *labels)[source]

Returns if the given version number is the latest.

read(*labels, **args)[source]

Return the PCollection as a list as well as the version number.

Parameters:
  • *labels – List of labels for PCollection instance.

  • **args – Dict of additional arguments. Currently only ‘tail’ as a boolean. When tail is True, will wait and read new elements until the cache is complete.

Returns:

A tuple containing an iterator for the items in the PCollection and the

version number.

It is possible that the version numbers from read() and_latest_version() are different. This usually means that the cache’s been evicted (thus unavailable => read() returns version = -1), but it had reached version n before eviction.

write(value, *labels)[source]

Writes the value to the given cache.

Parameters:
  • value – An encodable (with corresponding PCoder) value

  • *labels – List of labels for PCollection instance

clear(*labels)[source]

Clears the cache entry of the given labels and returns True on success.

Parameters:
  • value – An encodable (with corresponding PCoder) value

  • *labels – List of labels for PCollection instance

source(*labels)[source]

Returns a PTransform that reads the PCollection cache.

sink(labels, is_capture=False)[source]

Returns a PTransform that writes the PCollection cache.

TODO(BEAM-10514): Make sure labels will not be converted into an arbitrarily long file path: e.g., windows has a 260 path limit.

save_pcoder(pcoder, *labels)[source]

Saves pcoder for given PCollection.

Correct reading of PCollection from Cache requires PCoder to be known. This method saves desired PCoder for PCollection that will subsequently be used by sink(…), source(…), and, most importantly, read(…) method. The latter must be able to read a PCollection written by Beam using non-Beam IO.

Parameters:
  • pcoder – A PCoder to be used for reading and writing a PCollection.

  • *labels – List of labels for PCollection instance.

load_pcoder(*labels)[source]

Returns previously saved PCoder for reading and writing PCollection.

cleanup()[source]

Cleans up all the PCollection caches.

size(*labels: str) int[source]

Returns the size of the PCollection on disk in bytes.

class apache_beam.runners.interactive.cache_manager.FileBasedCacheManager(cache_dir=None, cache_format='text')[source]

Bases: CacheManager

Maps PCollections to local temp files for materialization.

size(*labels)[source]
exists(*labels)[source]
save_pcoder(pcoder, *labels)[source]
load_pcoder(*labels)[source]
read(*labels, **args)[source]
write(values, *labels)[source]

Imitates how a WriteCache transform works without running a pipeline.

For testing and cache manager development, not for production usage because the write is not sharded and does not use Beam execution model.

clear(*labels)[source]
source(*labels)[source]
sink(labels, is_capture=False)[source]
raw_sink(labels, is_capture=False)[source]
raw_source(*labels)[source]
cleanup()[source]
class apache_beam.runners.interactive.cache_manager.ReadCache(cache_manager, label)[source]

Bases: PTransform

A PTransform that reads the PCollections from the cache.

expand(pbegin)[source]
class apache_beam.runners.interactive.cache_manager.WriteCache(cache_manager, label, sample=False, sample_size=0, is_capture=False)[source]

Bases: PTransform

A PTransform that writes the PCollections to the cache.

expand(pcoll)[source]
class apache_beam.runners.interactive.cache_manager.SafeFastPrimitivesCoder[source]

Bases: Coder

This class add an quote/unquote step to escape special characters.

encode(value)[source]
decode(value)[source]
class apache_beam.runners.interactive.cache_manager.Base64Coder[source]

Bases: Coder

Used to safely encode arbitrary bytes to textio.

static encode(s, altchars=None)

Encode the bytes-like object s using Base64 and return a bytes object.

Optional altchars should be a byte string of length 2 which specifies an alternative alphabet for the ‘+’ and ‘/’ characters. This allows an application to e.g. generate url or filesystem safe Base64 strings.

static decode(s, altchars=None, validate=False)

Decode the Base64 encoded bytes-like object or ASCII string s.

Optional altchars must be a bytes-like object or ASCII string of length 2 which specifies the alternative alphabet used instead of the ‘+’ and ‘/’ characters.

The result is returned as a bytes object. A binascii.Error is raised if s is incorrectly padded.

If validate is False (the default), characters that are neither in the normal base-64 alphabet nor the alternative alphabet are discarded prior to the padding check. If validate is True, these non-alphabet characters in the input result in a binascii.Error.