org.apache.beam.sdk.transforms.ApproximateUnique

@Deprecated public class ApproximateUnique extends Object

Deprecated.

Consider using ApproximateCountDistinct in the zetasketch extension module, which makes use of the HllCount implementation.

If ApproximateCountDistinct does not meet your needs then you can directly use HllCount. Direct usage will also give you access to save intermediate aggregation result into a sketch for later processing.

For example, to estimate the number of distinct elements in a PCollection<String>:


 PCollection<String> input = ...;
 PCollection<Long> countDistinct =
     input.apply(HllCount.Init.forStrings().globally()).apply(HllCount.Extract.globally());

For more details about using HllCount and the zetasketch extension module, see https://s.apache.org/hll-in-beam#bookmark=id.v6chsij1ixo7.

PTransforms for estimating the number of distinct elements in a PCollection, or the number of distinct values associated with each key in a PCollection of KVs.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

ApproximateUnique.ApproximateUniqueCombineFn<T>

Deprecated.

CombineFn that computes an estimate of the number of distinct values that were combined.

static final class

ApproximateUnique.Globally<T>

Deprecated.

PTransform for estimating the number of distinct elements in a PCollection.

static final class

ApproximateUnique.PerKey<K,V>

Deprecated.

PTransform for estimating the number of distinct values associated with each key in a PCollection of KVs.
Constructor Summary

Constructors

Constructor

Description

ApproximateUnique()

Deprecated.
Method Summary

Modifier and Type

Method

Description

static <T> ApproximateUnique.Globally<T>

globally(double maximumEstimationError)

Deprecated.

Like globally(int), but specifies the desired maximum estimation error instead of the sample size.

static <T> ApproximateUnique.Globally<T>

globally(int sampleSize)

Deprecated.

Returns a PTransform that takes a PCollection<T> and returns a PCollection<Long> containing a single value that is an estimate of the number of distinct elements in the input PCollection.

static <K, V> ApproximateUnique.PerKey<K,V>

perKey(double maximumEstimationError)

Deprecated.

Like perKey(int), but specifies the desired maximum estimation error instead of the sample size.

static <K, V> ApproximateUnique.PerKey<K,V>

perKey(int sampleSize)

Deprecated.

Returns a PTransform that takes a PCollection<KV<K, V>> and returns a PCollection<KV<K, Long>> that contains an output element mapping each distinct key in the input PCollection to an estimate of the number of distinct values associated with that key in the input PCollection.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- ApproximateUnique
  
  public ApproximateUnique()
  
  Deprecated.
Method Details
- globally
  
  public static <T> ApproximateUnique.Globally<T> globally(int sampleSize)
  
  Deprecated.
  Returns a PTransform that takes a PCollection<T> and returns a PCollection<Long> containing a single value that is an estimate of the number of distinct elements in the input PCollection.
  The sampleSize parameter controls the estimation error. The error is about 2 / sqrt(sampleSize), so for ApproximateUnique.globally(10000) the estimation error is about 2%. Similarly, for ApproximateUnique.of(16) the estimation error is about 50%. If there are fewer than sampleSize distinct elements then the returned result will be exact with extremely high probability (the chance of a hash collision is about sampleSize^2 / 2^65).
  This transform approximates the number of elements in a set by computing the top sampleSize hash values, and using that to extrapolate the size of the entire set of hash values by assuming the rest of the hash values are as densely distributed as the top sampleSize.
  See also globally(double).
  Example of use:
  PCollection<String> pc = ...; PCollection<Long> approxNumDistinct = pc.apply(ApproximateUnique.<String>globally(1000));
  Type Parameters:
  
  T - the type of the elements in the input PCollection
  
  Parameters:
  
  sampleSize - the number of entries in the statistical sample; the higher this number, the more accurate the estimate will be; should be >= 16
  
  Throws:
  
  IllegalArgumentException - if the sampleSize argument is too small
- globally
  
  public static <T> ApproximateUnique.Globally<T> globally(double maximumEstimationError)
  
  Deprecated.
  
  Like globally(int), but specifies the desired maximum estimation error instead of the sample size.
  
  Type Parameters:
  
  T - the type of the elements in the input PCollection
  
  Parameters:
  
  maximumEstimationError - the maximum estimation error, which should be in the range [0.01, 0.5]
  
  Throws:
  
  IllegalArgumentException - if the maximumEstimationError argument is out of range
- perKey
  
  public static <K, V> ApproximateUnique.PerKey<K,V> perKey(int sampleSize)
  
  Deprecated.
  Returns a PTransform that takes a PCollection<KV<K, V>> and returns a PCollection<KV<K, Long>> that contains an output element mapping each distinct key in the input PCollection to an estimate of the number of distinct values associated with that key in the input PCollection.
  See globally(int) for an explanation of the sampleSize parameter. A separate sampling is computed for each distinct key of the input.
  See also perKey(double).
  Example of use:
  PCollection<KV<Integer, String>> pc = ...; PCollection<KV<Integer, Long>> approxNumDistinctPerKey = pc.apply(ApproximateUnique.<Integer, String>perKey(1000));
  Type Parameters:
  
  K - the type of the keys in the input and output PCollections
  
  V - the type of the values in the input PCollection
  
  Parameters:
  
  sampleSize - the number of entries in the statistical sample; the higher this number, the more accurate the estimate will be; should be >= 16
  
  Throws:
  
  IllegalArgumentException - if the sampleSize argument is too small
- perKey
  
  public static <K, V> ApproximateUnique.PerKey<K,V> perKey(double maximumEstimationError)
  
  Deprecated.
  
  Like perKey(int), but specifies the desired maximum estimation error instead of the sample size.
  
  Type Parameters:
  
  K - the type of the keys in the input and output PCollections
  
  V - the type of the values in the input PCollection
  
  Parameters:
  
  maximumEstimationError - the maximum estimation error, which should be in the range [0.01, 0.5]
  
  Throws:
  
  IllegalArgumentException - if the maximumEstimationError argument is out of range

Class ApproximateUnique

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

ApproximateUnique

Method Details

globally

globally

perKey

perKey