apache_beam.runners.dataflow.internal package


apache_beam.runners.dataflow.internal.apiclient module

For internal use only. No backwards compatibility guarantees.

Dataflow client utility functions.

class apache_beam.runners.dataflow.internal.apiclient.DataflowApplicationClient(options, environment_version)[source]

Bases: object

A Dataflow API client used by application code to create and query jobs.

create_job(*args, **kwargs)

Creates a job described by the workflow proto.

get_job(*args, **kwargs)
get_job_metrics(*args, **kwargs)
list_messages(*args, **kwargs)
modify_job_state(*args, **kwargs)
stage_file(gcs_or_local_path, file_name, stream, mime_type='application/octet-stream')[source]

Stages a file at a GCS or local path with stream-supplied contents.

submit_job_description(*args, **kwargs)
class apache_beam.runners.dataflow.internal.apiclient.Environment(packages, options, environment_version)[source]

Bases: object

Wrapper for a dataflow Environment protobuf.

class apache_beam.runners.dataflow.internal.apiclient.Job(options)[source]

Bases: object

Wrapper for a dataflow Job protobuf.

static default_job_name(job_name)[source]
class apache_beam.runners.dataflow.internal.apiclient.MetricUpdateTranslators[source]

Bases: object

Translators between accumulators and dataflow metric updates.

static translate_boolean(accumulator, metric_update_proto)[source]
static translate_scalar_counter_float(accumulator, metric_update_proto)[source]
static translate_scalar_counter_int(accumulator, metric_update_proto)[source]
static translate_scalar_mean_float(accumulator, metric_update_proto)[source]
static translate_scalar_mean_int(accumulator, metric_update_proto)[source]
class apache_beam.runners.dataflow.internal.apiclient.Step(step_kind, step_name, additional_properties=None)[source]

Bases: object

Wrapper for a dataflow Step protobuf.

add_property(name, value, with_type=False)[source]

Returns name if it is one of the outputs or first output if name is None.

Parameters:tag – tag of the output as a string or None if we want to get the name of the first output.
Returns:The name of the output associated with the tag or the first output if tag was None.
Raises:ValueError – if the tag does not exist within outputs.
apache_beam.runners.dataflow.internal.apiclient.translate_distribution(distribution_update, metric_update_proto)[source]

Translate metrics DistributionUpdate to dataflow distribution update.

apache_beam.runners.dataflow.internal.apiclient.translate_mean(accumulator, metric_update)[source]
apache_beam.runners.dataflow.internal.apiclient.translate_scalar(accumulator, metric_update)[source]
apache_beam.runners.dataflow.internal.apiclient.translate_value(value, metric_update_proto)[source]

apache_beam.runners.dataflow.internal.dependency module

Support for installing custom code and required dependencies.

Workflows, with the exception of very simple ones, are organized in multiple modules and packages. Typically, these modules and packages have dependencies on other standard libraries. Dataflow relies on the Python setuptools package to handle these scenarios. For further details please read: https://pythonhosted.org/an_example_pypi_project/setuptools.html

When a runner tries to run a pipeline it will check for a –requirements_file and a –setup_file option.

If –setup_file is present then it is assumed that the folder containing the file specified by the option has the typical layout required by setuptools and it will run ‘python setup.py sdist’ to produce a source distribution. The resulting tarball (a .tar or .tar.gz file) will be staged at the GCS staging location specified as job option. When a worker starts it will check for the presence of this file and will run ‘easy_install tarball’ to install the package in the worker.

If –requirements_file is present then the file specified by the option will be staged in the GCS staging location. When a worker starts it will check for the presence of this file and will run ‘pip install -r requirements.txt’. A requirements file can be easily generated by running ‘pip freeze -r requirements.txt’. The reason a Dataflow runner does not run this automatically is because quite often only a small fraction of the dependencies present in a requirements.txt file are actually needed for remote execution and therefore a one-time manual trimming is desirable.

TODO(silviuc): Staged files should have a job specific prefix. To prevent several jobs in the same project stomping on each other due to a shared staging location.

TODO(silviuc): Should we allow several setup packages? TODO(silviuc): We should allow customizing the exact command for setup build.


For internal use only; no backwards-compatibility guarantees.

Returns the Google Cloud Dataflow container version for remote execution.


For internal use only; no backwards-compatibility guarantees.

Returns name and version of SDK reported to Google Cloud Dataflow.


For internal use only; no backwards-compatibility guarantees.

Returns the PyPI package name to be staged to Google Cloud Dataflow.

apache_beam.runners.dataflow.internal.dependency.stage_job_resources(options, file_copy=<function _dependency_file_copy>, build_setup_args=None, temp_dir=None, populate_requirements_cache=<function _populate_requirements_cache>)[source]

For internal use only; no backwards-compatibility guarantees.

Creates (if needed) and stages job resources to options.staging_location.

  • options – Command line options. More specifically the function will expect staging_location, requirements_file, setup_file, and save_main_session options to be present.
  • file_copy – Callable for copying files. The default version will copy from a local file to a GCS location using the gsutil tool available in the Google Cloud SDK package.
  • build_setup_args – A list of command line arguments used to build a setup package. Used only if options.setup_file is not None. Used only for testing.
  • temp_dir – Temporary folder where the resource building can happen. If None then a unique temp directory will be created. Used only for testing.
  • populate_requirements_cache – Callable for populating the requirements cache. Used only for testing.

A list of file names (no paths) for the resources staged. All the files are assumed to be staged in options.staging_location.


RuntimeError – If files specified are not found or error encountered while trying to create the resources (e.g., build a setup package).

apache_beam.runners.dataflow.internal.names module

Various names for properties, transforms, etc.

class apache_beam.runners.dataflow.internal.names.PropertyNames[source]

Bases: object

For internal use only; no backwards-compatibility guarantees.

Property strings as they are expected in the CloudWorkflow protos.

BIGQUERY_CREATE_DISPOSITION = 'create_disposition'
BIGQUERY_EXPORT_FORMAT = 'bigquery_export_format'
BIGQUERY_FLATTEN_RESULTS = 'bigquery_flatten_results'
BIGQUERY_QUERY = 'bigquery_query'
BIGQUERY_USE_LEGACY_SQL = 'bigquery_use_legacy_sql'
BIGQUERY_WRITE_DISPOSITION = 'write_disposition'
DISPLAY_DATA = 'display_data'
ELEMENT = 'element'
ELEMENTS = 'elements'
ENCODING = 'encoding'
FILE_NAME_PREFIX = 'filename_prefix'
FILE_NAME_SUFFIX = 'filename_suffix'
FILE_PATTERN = 'filepattern'
FORMAT = 'format'
INPUTS = 'inputs'
NON_PARALLEL_INPUTS = 'non_parallel_inputs'
NUM_SHARDS = 'num_shards'
OUT = 'out'
OUTPUT = 'output'
OUTPUT_INFO = 'output_info'
OUTPUT_NAME = 'output_name'
PARALLEL_INPUT = 'parallel_input'
PUBSUB_ID_LABEL = 'pubsub_id_label'
PUBSUB_SUBSCRIPTION = 'pubsub_subscription'
PUBSUB_TOPIC = 'pubsub_topic'
SERIALIZED_FN = 'serialized_fn'
SHARD_NAME_TEMPLATE = 'shard_template'
SOURCE_STEP_INPUT = 'custom_source_step_input'
STEP_NAME = 'step_name'
USER_FN = 'user_fn'
USER_NAME = 'user_name'
VALIDATE_SINK = 'validate_sink'
VALIDATE_SOURCE = 'validate_source'
VALUE = 'value'
class apache_beam.runners.dataflow.internal.names.TransformNames[source]

Bases: object

For internal use only; no backwards-compatibility guarantees.

Transform strings as they are expected in the CloudWorkflow protos.

COLLECTION_TO_SINGLETON = 'CollectionToSingleton'
COMBINE = 'CombineValues'
CREATE_PCOLLECTION = 'CreateCollection'
DO = 'ParallelDo'
FLATTEN = 'Flatten'
GROUP = 'GroupByKey'
READ = 'ParallelRead'
WRITE = 'ParallelWrite'

Module contents