Sample

Pydoc Pydoc




Transforms for taking samples of the elements in a collection, or samples of the values associated with each key in a collection of key-value pairs.

Examples

In the following example, we create a pipeline with a PCollection. Then, we get a random sample of elements in different ways.

Example 1: Sample elements from a PCollection

We use Sample.FixedSizeGlobally() to get a fixed-size random sample of elements from the entire PCollection.

import apache_beam as beam

with beam.Pipeline() as pipeline:
  sample = (
      pipeline
      | 'Create produce' >> beam.Create([
          'πŸ“ Strawberry',
          'πŸ₯• Carrot',
          'πŸ† Eggplant',
          'πŸ… Tomato',
          'πŸ₯” Potato',
      ])
      | 'Sample N elements' >> beam.combiners.Sample.FixedSizeGlobally(3)
      | beam.Map(print))

Output:

['πŸ₯• Carrot', 'πŸ† Eggplant', 'πŸ… Tomato']
View source code View source code




Example 2: Sample elements for each key

We use Sample.FixedSizePerKey() to get fixed-size random samples for each unique key in a PCollection of key-values.

import apache_beam as beam

with beam.Pipeline() as pipeline:
  samples_per_key = (
      pipeline
      | 'Create produce' >> beam.Create([
          ('spring', 'πŸ“'),
          ('spring', 'πŸ₯•'),
          ('spring', 'πŸ†'),
          ('spring', 'πŸ…'),
          ('summer', 'πŸ₯•'),
          ('summer', 'πŸ…'),
          ('summer', '🌽'),
          ('fall', 'πŸ₯•'),
          ('fall', 'πŸ…'),
          ('winter', 'πŸ†'),
      ])
      | 'Samples per key' >> beam.combiners.Sample.FixedSizePerKey(3)
      | beam.Map(print))

Output:

('spring', ['πŸ“', 'πŸ₯•', 'πŸ†'])
('summer', ['πŸ₯•', 'πŸ…', '🌽'])
('fall', ['πŸ₯•', 'πŸ…'])
('winter', ['πŸ†'])
View source code View source code




Pydoc Pydoc