apache_beam.runners.interactive.interactive_beam module

Module of Interactive Beam features that can be used in notebook.

The purpose of the module is to reduce the learning curve of Interactive Beam users, provide a single place for importing and add sugar syntax for all Interactive Beam components. It gives users capability to interact with existing environment/session/context for Interactive Beam and visualize PCollections as bounded dataset. In the meantime, it hides the interactivity implementation from users so that users can focus on developing Beam pipeline without worrying about how hidden states in the interactive session are managed.

Note: If you want backward-compatibility, only invoke interfaces provided by this module in your notebook or application code.

class apache_beam.runners.interactive.interactive_beam.Options[source]

Bases: apache_beam.runners.interactive.options.interactive_options.InteractiveOptions

Options that guide how Interactive Beam works.

enable_capture_replay

Whether replayable source data capture should be replayed for multiple PCollection evaluations and pipeline runs as long as the data captured is still valid.

capturable_sources

Interactive Beam automatically captures data from sources in this set.

capture_duration

The data capture of sources ends as soon as the background caching job has run for this long.

capture_size_limit

The data capture of sources ends as soon as the size (in bytes) of data captured from capturable sources reaches the limit.

display_timestamp_format

The format in which timestamps are displayed.

Default is ‘%Y-%m-%d %H:%M:%S.%f%z’, e.g. 2020-02-01 15:05:06.000015-08:00.

display_timezone

The timezone in which timestamps are displayed.

Defaults to local timezone.

apache_beam.runners.interactive.interactive_beam.watch(watchable)[source]

Monitors a watchable.

This allows Interactive Beam to implicitly pass on the information about the location of your pipeline definition.

Current implementation mainly watches for PCollection variables defined in user code. A watchable can be a dictionary of variable metadata such as locals(), a str name of a module, a module object or an instance of a class. The variable can come from any scope even local variables in a method of a class defined in a module.

Below are all valid:

watch(__main__)  # if import __main__ is already invoked
watch('__main__')  # does not require invoking import __main__ beforehand
watch(self)  # inside a class
watch(SomeInstance())  # an instance of a class
watch(locals())  # inside a function, watching local variables within

If you write a Beam pipeline in the __main__ module directly, since the __main__ module is always watched, you don’t have to instruct Interactive Beam. If your Beam pipeline is defined in some module other than __main__, such as inside a class function or a unit test, you can watch() the scope.

For example:

class Foo(object)
  def run_pipeline(self):
    with beam.Pipeline() as p:
      init_pcoll = p |  'Init Create' >> beam.Create(range(10))
      watch(locals())
    return init_pcoll
init_pcoll = Foo().run_pipeline()

Interactive Beam caches init_pcoll for the first run.

Then you can use:

show(init_pcoll)

To visualize data from init_pcoll once the pipeline is executed.

apache_beam.runners.interactive.interactive_beam.collect(pcoll, include_window_info=False)[source]

Materializes all of the elements from a PCollection into a Dataframe.

For example:

p = beam.Pipeline(InteractiveRunner())
init = p | 'Init' >> beam.Create(range(10))
square = init | 'Square' >> beam.Map(lambda x: x * x)

# Run the pipeline and bring the PCollection into memory as a Dataframe.
in_memory_square = collect(square)
apache_beam.runners.interactive.interactive_beam.evict_captured_data()[source]

Forcefully evicts all captured replayable data.

Once invoked, Interactive Beam will capture new data based on the guidance of options the next time it evaluates/visualizes PCollections or runs pipelines.