SDK Harness Configuration
Beam allows configuration of the SDK harness to accommodate varying cluster setups. (The options below are for Python, but much of this information should apply to the Java and Go SDKs as well.)
environment_type
determines where user code will be executed.environment_config
configures the environment depending on the value ofenvironment_type
.DOCKER
(default): User code is executed within a container started on each worker node. This requires docker to be installed on worker nodes.PROCESS
: User code is executed by processes that are automatically started by the runner on each worker node.environment_config
: JSON of the form{"os": "<OS>", "arch": "<ARCHITECTURE>", "command": "<process to execute>", "env":{"<Environment variables 1>": "<ENV_VAL>"} }
. All fields in the JSON are optional exceptcommand
.- For
command
, it is recommended to use the bootloader executable, which can be built from source with./gradlew :sdks:python:container:build
and copied fromsdks/python/container/build/target/launcher/linux_amd64/boot
to worker machines. Note that the Python bootloader assumes Python and theapache_beam
module are installed on each worker machine.
- For
EXTERNAL
: User code will be dispatched to an external service. For example, one can start an external service for Python workers by runningdocker run -p=50000:50000 apache/beam_python3.6_sdk --worker_pool
.environment_config
: Address for the external service, e.g.localhost:50000
.- To access a Dockerized worker pool service from a Mac or Windows client, set the
BEAM_WORKER_POOL_IN_DOCKER_VM
environment variable on the client:export BEAM_WORKER_POOL_IN_DOCKER_VM=1
.
LOOPBACK
: User code is executed within the same process that submitted the pipeline. This option is useful for local testing. However, it is not suitable for a production environment, as it performs work on the machine the job originated from.environment_config
is not used for theLOOPBACK
environment.
sdk_worker_parallelism
sets the number of SDK workers that run on each worker node. The default is 1. If 0, the value is automatically set by the runner by looking at different parameters, such as the number of CPU cores on the worker machine. Only used for Python pipelines on Flink and Spark runners.
Last updated on 2024/12/20
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!