Managing Python Pipeline Dependencies
Note: This page is only applicable to runners that do remote execution.
When you run your pipeline locally, the packages that your pipeline depends on are available because they are installed on your local machine. However, when you want to run your pipeline remotely, you must make sure these dependencies are available on the remote machines. This tutorial shows you how to make your dependencies available to the remote workers. Each section below refers to a different source that your package may have been installed from.
Note: Remote workers used for pipeline execution typically have a standard Python distribution installation in a Debian-based container image. If your code relies only on standard Python packages, then you probably don’t need to do anything on this page.
PyPI Dependencies
If your pipeline uses public packages from the Python Package Index, make these packages available remotely by performing the following steps:
Note: If your PyPI package depends on a non-Python package (e.g. a package that requires installation on Linux using the apt-get install
command), see the PyPI Dependencies with Non-Python Dependencies section instead.
Find out which packages are installed on your machine. Run the following command:
pip freeze > requirements.txt
This command creates a
requirements.txt
file that lists all packages that are installed on your machine, regardless of where they were installed from.Edit the
requirements.txt
file and delete all packages that are not relevant to your code.Run your pipeline with the following command-line option:
--requirements_file requirements.txt
The runner will use the
requirements.txt
file to install your additional dependencies onto the remote workers.
NOTE: An alternative to
pip freeze
is to use a library like pip-tools to compile all the dependencies required for the pipeline from a--requirements_file
, where only top-level dependencies are mentioned.
Custom Containers
You can pass a container image with all the dependencies that are needed for the pipeline instead of requirements.txt
. Follow the instructions on how to run pipeline with Custom Container images.
If you are using a custom container image, we recommend that you install the dependencies from the
--requirements_file
directly into your image at build time. In this case, you do not need to pass--requirements_file
option at runtime, which will reduce the pipeline startup time.# Add these lines with the path to the requirements.txt to the Dockerfile COPY <path to requirements.txt> /tmp/requirements.txt RUN python -m pip install -r /tmp/requirements.txt
Local or non-PyPI Dependencies
If your pipeline uses packages that are not available publicly (e.g. packages that you’ve downloaded from a GitHub repo), make these packages available remotely by performing the following steps:
Identify which packages are installed on your machine and are not public. Run the following command:
pip freeze
This command lists all packages that are installed on your machine, regardless of where they were installed from.
Run your pipeline with the following command-line option:
--extra_package /path/to/package/package-name
where package-name is the package’s tarball. If you have the
setup.py
for that package then you can build the tarball with the following command:python setup.py sdist
See the sdist documentation for more details on this command.
Multiple File Dependencies
Often, your pipeline code spans multiple files. To run your project remotely, you must group these files as a Python package and specify the package when you run your pipeline. When the remote workers start, they will install your package. To group your files as a Python package and make it available remotely, perform the following steps:
Create a setup.py file for your project. The following is a very basic
setup.py
file.import setuptools setuptools.setup( name='PACKAGE-NAME', version='PACKAGE-VERSION', install_requires=[], packages=setuptools.find_packages(), )
Structure your project so that the root directory contains the
setup.py
file, the main workflow file, and a directory with the rest of the files.root_dir/ setup.py main.py other_files_dir/
See Juliaset for an example that follows this required project structure.
Run your pipeline with the following command-line option:
--setup_file /path/to/setup.py
Note: If you created a requirements.txt file and your project spans multiple files, you can get rid of the requirements.txt
file and instead, add all packages contained in requirements.txt
to the install_requires
field of the setup call (in step 1).
Non-Python Dependencies or PyPI Dependencies with Non-Python Dependencies
If your pipeline uses non-Python packages (e.g. packages that require installation using the apt-get install
command), or uses a PyPI package that depends on non-Python dependencies during package installation, you must perform the following steps.
Add the required installation commands (e.g. the
apt-get install
commands) for the non-Python dependencies to the list ofCUSTOM_COMMANDS
in yoursetup.py
file. See the Juliaset setup.py for an example.Note: You must make sure that these commands are runnable on the remote worker (e.g. if you use
apt-get
, the remote worker needsapt-get
support).If you are using a PyPI package that depends on non-Python dependencies, add
['pip', 'install', '<your PyPI package>']
to the list ofCUSTOM_COMMANDS
in yoursetup.py
file.Structure your project so that the root directory contains the
setup.py
file, the main workflow file, and a directory with the rest of the files.root_dir/ setup.py main.py other_files_dir/
See the Juliaset project for an example that follows this required project structure.
Run your pipeline with the following command-line option:
--setup_file /path/to/setup.py
Note: Because custom commands execute after the dependencies for your workflow are installed (by pip
), you should omit the PyPI package dependency from the pipeline’s requirements.txt
file and from the install_requires
parameter in the setuptools.setup()
call of your setup.py
file.
Pre-building SDK Container Image
In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via --requirements_file
and other runtime options) are installed into the containers at runtime. This can increase the worker startup time.
However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start with --prebuild_sdk_container_engine
. For instructions of how to use pre-building with Google Cloud
Dataflow, see Pre-building the python SDK custom container image with extra dependencies.
NOTE: This feature is available only for the Dataflow Runner v2
.
Pickling and Managing the Main Session
When the Python SDK submits the pipeline for execution to a remote runner, the pipeline contents, such as transform user code, is serialized (or pickled) into a bytecode using
libraries that perform the serialization (also called picklers). The default pickler library used by Beam is dill
.
To use the cloudpickle
pickler, supply the --pickle_library=cloudpickle
pipeline option. The cloudpickle
support is currently experimental.
By default, global imports, functions, and variables defined in the main pipeline module are not saved during the serialization of a Beam job.
Thus, one might encounter an unexpected NameError
when running a DoFn
on any remote runner. To resolve this, supply the main session content with the pipeline by
setting the --save_main_session
pipeline option. This will load the pickled state of the global namespace onto the Dataflow workers (if using DataflowRunner
).
For example, see Handling NameErrors to set the main session on the DataflowRunner
.
Managing the main session in Python SDK is only necessary when using dill
pickler on any remote runner. Therefore, this issue will
not occur in DirectRunner
.
Since serialization of the pipeline happens on the job submission, and deserialization happens at runtime, it is imperative that the same version of pickling library is used at job submission and at runtime.
To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of dill
or cloudpickle
required by Beam, and choose to
install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
or by specifying a pipeline dependency requirement).