GroupIntoBatches

Pydoc Pydoc




Batches the input into desired batch size.

Examples

In the following example, we create a pipeline with a PCollection of produce by season.

We use GroupIntoBatches to get fixed-sized batches for every key, which outputs a list of elements for every key.

import apache_beam as beam

with beam.Pipeline() as pipeline:
  batches_with_keys = (
      pipeline
      | 'Create produce' >> beam.Create([
          ('spring', 'πŸ“'),
          ('spring', 'πŸ₯•'),
          ('spring', 'πŸ†'),
          ('spring', 'πŸ…'),
          ('summer', 'πŸ₯•'),
          ('summer', 'πŸ…'),
          ('summer', '🌽'),
          ('fall', 'πŸ₯•'),
          ('fall', 'πŸ…'),
          ('winter', 'πŸ†'),
      ])
      | 'Group into batches' >> beam.GroupIntoBatches(3)
      | beam.Map(print))

Output:

('spring', ['πŸ“', 'πŸ₯•', 'πŸ†'])
('summer', ['πŸ₯•', 'πŸ…', '🌽'])
('spring', ['πŸ…'])
('fall', ['πŸ₯•', 'πŸ…'])
('winter', ['πŸ†'])

For unkeyed data and dynamic batch sizes, one may want to use BatchElements.

Pydoc Pydoc