Grouping elements for efficient external service calls using the GroupIntoBatches
-transform
- Java SDK
- Python SDK
Usually, authoring an Apache Beam pipeline can be done with out-of-the-box tools and transforms like ParDo’s, Window’s and GroupByKey’s. However, when you want more tight control, you can keep state in an otherwise stateless DoFn.
State is kept on a per-key and per-windows basis, and as such, the input to your stateful DoFn needs to be keyed (e.g. by the customer identifier if you’re tracking clicks from an e-commerce website).
Examples of use cases are: assigning a unique ID to each element, joining streams of data in ‘more exotic’ ways, or batching up API calls to external services. In this section we’ll go over the last one in particular.
Make sure to check the docs for deeper understanding on state and timers.
The GroupIntoBatches
-transform uses state and timers under the hood to allow the user to exercise tight control over the following parameters:
maxBufferDuration
: limits the amount of waitingtime for a batch to be emitted.batchSize
: limits the number of elements in one batch.batchSizeBytes
: (in Java only) limits the bytesize of one batch (using input coder to determine elementsize).elementByteSize
: (in Java only) limits the bytesize of one batch (using a user defined function to determine elementsize).
while abstracting away the implementation details from users.
The withShardedKey()
functionality increases parallelism by spreading one key over multiple threads.
The transforms are used in the following way in Java & Python:
Applying these transforms will output groups of elements in a batch on a per-key basis, which you can then use to call an external API in bulk rather than on a per-element basis, resulting in a lower overhead in your pipeline.
Last updated on 2025/01/19
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!