A transform for generic parallel processing.
ParDo transform considers each element in the input
performs some processing function (your user code) on that element,
and emits zero or more elements to an output
See more information in the Beam Programming Guide.
In the following examples, we explore how to create custom
DoFns and access
the timestamp and windowing information.
Example 1: ParDo with a simple DoFn
The following example defines a simple
DoFn class called
which stores the
delimiter as an object field.
process method is called once per element,
and it can yield zero or more output elements.
Example 2: ParDo with timestamp and window information
In this example, we add new parameters to the
process method to bind parameter values at runtime.
beam.DoFn.TimestampParambinds the timestamp information as an
beam.DoFn.WindowParambinds the window information as the appropriate
Example 3: ParDo with DoFn methods
can be customized with a number of methods that can help create more complex behaviors.
You can customize what a worker does when it starts and shuts down with
You can also customize what to do when a
bundle of elements
starts and finishes with
DoFn.setup(): Called whenever the
DoFninstance is deserialized on the worker. This means it can be called more than once per worker because multiple instances of a given
DoFnsubclass may be created (e.g., due to parallelization, or due to garbage collection after a period of disuse). This is a good place to connect to database instances, open network connections or other resources.
DoFn.start_bundle(): Called once per bundle of elements before calling
processon the first element of the bundle. This is a good place to start keeping track of the bundle elements.
DoFn.finish_bundle(): Called once per bundle of elements after calling
processafter the last element of the bundle, can yield zero or more elements. This is a good place to do batch calls on a bundle of elements, such as running a database query.
For example, you can initialize a batch in
start_bundle, add elements to the batch in
processinstead of yielding them, then running a batch query on those elements on
finish_bundle, and yielding all the results.
Note that yielded elements from
finish_bundlemust be of the type
apache_beam.utils.windowed_value.WindowedValue. You need to provide a timestamp as a unix timestamp, which you can get from the last processed element. You also need to provide a window, which you can get from the last processed element like in the example below.
DoFn.teardown(): Called once (as a best effort) per
DoFninstance when the
DoFninstance is shutting down. This is a good place to close database instances, close network connections or other resources.
teardownis called as a best effort and is not guaranteed. For example, if the worker crashes,
teardownmight not be called.
- [Issue 19394]
DoFn.teardown()metrics are lost.
- Map behaves the same, but produces exactly one output for each input.
- FlatMap behaves the same as
Map, but for each input it may produce zero or more outputs.
- Filter is useful if the function is just deciding whether to output an element or not.
Last updated on 2023/12/11
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!