apache_beam.runners.interactive.caching.streaming_cache module

class apache_beam.runners.interactive.caching.streaming_cache.StreamingCacheSink(cache_dir, filename, sample_resolution_sec, coder=SafeFastPrimitivesCoder)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform that writes TestStreamFile(Header|Records)s to file.

This transform takes in an arbitrary element stream and writes the list of TestStream events (as TestStreamFileRecords) to file. When replayed, this will produce the best-effort replay of the original job (e.g. some elements may be produced slightly out of order from the original stream).

Note that this PTransform is assumed to be only run on a single machine where the following assumptions are correct: elements come in ordered, no two transforms are writing to the same file. This PTransform is assumed to only run correctly with the DirectRunner.

TODO(https://github.com/apache/beam/issues/20002): Generalize this to more source/sink types aside from file based. Also, generalize to cases where there might be multiple workers writing to the same sink.

path

Returns the path the sink leads to.

size_in_bytes

Returns the space usage in bytes of the sink.

expand(pcoll)[source]
class apache_beam.runners.interactive.caching.streaming_cache.StreamingCacheSource(cache_dir, labels, is_cache_complete=None, coder=None)[source]

Bases: object

A class that reads and parses TestStreamFile(Header|Reader)s.

This source operates in the following way:

  1. Wait for up to timeout_secs for the file to be available.
  2. Read, parse, and emit the entire contents of the file
  3. Wait for more events to come or until is_cache_complete returns True
  4. If there are more events, then go to 2
  5. Otherwise, stop emitting.

This class is used to read from file and send its to the TestStream via the StreamingCacheManager.Reader.

read(tail)[source]

Reads all TestStreamFile(Header|TestStreamFileRecord)s from file.

This returns a generator to be able to read all lines from the given file. If tail is True, then it will wait until the cache is complete to exit. Otherwise, it will read the file only once.

class apache_beam.runners.interactive.caching.streaming_cache.StreamingCache(cache_dir, is_cache_complete=None, sample_resolution_sec=0.1, saved_pcoders=None)[source]

Bases: apache_beam.runners.interactive.cache_manager.CacheManager

Abstraction that holds the logic for reading and writing to cache.

size(*labels)[source]
capture_size
capture_paths
capture_keys
exists(*labels)[source]
read(*labels, **args)[source]

Returns a generator to read all records from file.

read_multiple(labels, tail=True)[source]

Returns a generator to read all records from file.

Does tail until the cache is complete. This is because it is used in the TestStreamServiceController to read from file which is only used during pipeline runtime which needs to block.

write(values, *labels)[source]

Writes the given values to cache.

clear(*labels)[source]
source(*labels)[source]

Returns the StreamingCacheManager source.

This is beam.Impulse() because unbounded sources will be marked with this and then the PipelineInstrument will replace these with a TestStream.

sink(labels, is_capture=False)[source]

Returns a StreamingCacheSink to write elements to file.

Note that this is assumed to only work in the DirectRunner as the underlying StreamingCacheSink assumes a single machine to have correct element ordering.

save_pcoder(pcoder, *labels)[source]
load_pcoder(*labels)[source]
cleanup()[source]
class Reader(headers, readers)[source]

Bases: object

Abstraction that reads from PCollection readers.

This class is an Abstraction layer over multiple PCollection readers to be used for supplying a TestStream service with events.

This class is also responsible for holding the state of the clock, injecting clock advancement events, and watermark advancement events.

read()[source]

Reads records from PCollection readers.