apache_beam.dataframe.doctests module

A module that allows running existing pandas doctests with Beam dataframes.

This module hooks into the doctesting framework by providing a custom runner and, in particular, an OutputChecker, as well as providing a fake object for mocking out the pandas module.

The (novel) sequence of events when running a doctest is as follows.

  1. The test invokes pd.DataFrame(…) (or similar) and an actual dataframe is computed and stashed but a Beam deferred dataframe is returned in its place.
  2. Computations are done on these “dataframes,” resulting in new objects, but as these are actually deferred, only expression trees are built. In the background, a mapping of id -> deferred dataframe is stored for each newly created dataframe.
  3. When any dataframe is printed out, the repr has been overwritten to print Dataframe[id]. The aforementened mapping is used to map this back to the actual dataframe object, which is then computed via Beam, and its the (stringified) result plugged into the actual output for comparison.
  4. The comparison is then done on the sorted lines of the expected and actual values.
class apache_beam.dataframe.doctests.TestEnvironment[source]

Bases: object

A class managing the patching (of methods, inputs, and outputs) needed to run and validate tests.

These classes are patched to be able to recognize and retrieve inputs and results, stored in self._inputs and self._all_frames respectively.


Creates a context within which DeferredFrame types are monkey patched to record ids.

class apache_beam.dataframe.doctests.BeamDataframeDoctestRunner(env, use_beam=True, **kwargs)[source]

Bases: doctest.DocTestRunner

A Doctest runner suitable for replacing the pd module with one backed by beam.

run(test, **kwargs)[source]
apache_beam.dataframe.doctests.teststring(text, report=True, **runner_kwargs)[source]
apache_beam.dataframe.doctests.testfile(*args, **kwargs)[source]
apache_beam.dataframe.doctests.testmod(*args, **kwargs)[source]