apache_beam.transforms.stats module

This module has all statistic related transforms.

This ApproximateUnique class will be deprecated [1]. PLease look into using HLLCount in the zetasketch extension module [2].

[1] https://lists.apache.org/thread.html/501605df5027567099b81f18c080469661fb426 4a002615fa1510502%40%3Cdev.beam.apache.org%3E [2] https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensio ns/zetasketch/HllCount.html

class apache_beam.transforms.stats.ApproximateQuantiles

Bases: object

PTransform for getting the idea of data distribution using approximate N-tile (e.g. quartiles, percentiles etc.) either globally or per-key.

Examples

in: list(range(101)), num_quantiles=5

out: [0, 25, 50, 75, 100]

in: [(i, 1 if i<10 else 1e-5) for i in range(101)], num_quantiles=5,

weighted=True

out: [0, 2, 5, 7, 100]

in: [list(range(10)), …, list(range(90, 101))], num_quantiles=5,

input_batched=True

out: [0, 25, 50, 75, 100]

in: [(list(range(10)), [1]*10), (list(range(10)), [0]*10), …,

(list(range(90, 101)), [0]*11)], num_quantiles=5, input_batched=True, weighted=True

out: [0, 2, 5, 7, 100]

class Globally(num_quantiles, key=None, reverse=False, weighted=False, input_batched=False)

Bases: PTransform

PTransform takes PCollection and returns a list whose single value is approximate N-tiles of the input collection globally.

Parameters:
  • num_quantiles – number of elements in the resulting quantiles values list.

  • key – (optional) Key is a mapping of elements to a comparable key, similar to the key argument of Python’s sorting methods.

  • reverse – (optional) whether to order things smallest to largest, rather than largest to smallest.

  • weighted – (optional) if set to True, the transform returns weighted quantiles. The input PCollection is then expected to contain tuples of input values with the corresponding weight.

  • input_batched – (optional) if set to True, the transform expects each element of input PCollection to be a batch, which is a list of elements for non-weighted case and a tuple of lists of elements and weights for weighted. Provides a way to accumulate multiple elements at a time more efficiently.

display_data()
expand(pcoll)
class PerKey(num_quantiles, key=None, reverse=False, weighted=False, input_batched=False)

Bases: PTransform

PTransform takes PCollection of KV and returns a list based on each key whose single value is list of approximate N-tiles of the input element of the key.

Parameters:
  • num_quantiles – number of elements in the resulting quantiles values list.

  • key – (optional) Key is a mapping of elements to a comparable key, similar to the key argument of Python’s sorting methods.

  • reverse – (optional) whether to order things smallest to largest, rather than largest to smallest.

  • weighted – (optional) if set to True, the transform returns weighted quantiles. The input PCollection is then expected to contain tuples of input values with the corresponding weight.

  • input_batched – (optional) if set to True, the transform expects each element of input PCollection to be a batch, which is a list of elements for non-weighted case and a tuple of lists of elements and weights for weighted. Provides a way to accumulate multiple elements at a time more efficiently.

display_data()
expand(pcoll)
class apache_beam.transforms.stats.ApproximateUnique

Bases: object

Hashes input elements and uses those to extrapolate the size of the entire set of hash values by assuming the rest of the hash values are as densely distributed as the sample space.

class Globally(size=None, error=None)

Bases: PTransform

Approximate.Globally approximate number of unique values

expand(pcoll)
class PerKey(size=None, error=None)

Bases: PTransform

Approximate.PerKey approximate number of unique values per key

expand(pcoll)
static parse_input_params(size=None, error=None)

Check if input params are valid and return sample size.

Parameters:
  • size – an int not smaller than 16, which we would use to estimate number of unique values.

  • error – max estimation error, which is a float between 0.01 and 0.50. If error is given, sample size will be calculated from error with _get_sample_size_from_est_error function.

Returns:

sample size

Raises:

ValueError: If both size and error are given, or neither is given, or values are out of range.