Class ApproximateUnique
PTransforms for estimating the number of distinct elements in a PCollection, or
the number of distinct values associated with each key in a PCollection of KVs.-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classDeprecated.CombineFnthat computes an estimate of the number of distinct values that were combined.static final classDeprecated.PTransformfor estimating the number of distinct elements in aPCollection.static final classDeprecated.PTransformfor estimating the number of distinct values associated with each key in aPCollectionofKVs. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic <T> ApproximateUnique.Globally<T> globally(double maximumEstimationError) Deprecated.Likeglobally(int), but specifies the desired maximum estimation error instead of the sample size.static <T> ApproximateUnique.Globally<T> globally(int sampleSize) Deprecated.Returns aPTransformthat takes aPCollection<T>and returns aPCollection<Long>containing a single value that is an estimate of the number of distinct elements in the inputPCollection.static <K,V> ApproximateUnique.PerKey <K, V> perKey(double maximumEstimationError) Deprecated.LikeperKey(int), but specifies the desired maximum estimation error instead of the sample size.static <K,V> ApproximateUnique.PerKey <K, V> perKey(int sampleSize) Deprecated.Returns aPTransformthat takes aPCollection<KV<K, V>>and returns aPCollection<KV<K, Long>>that contains an output element mapping each distinct key in the inputPCollectionto an estimate of the number of distinct values associated with that key in the inputPCollection.
-
Constructor Details
-
ApproximateUnique
public ApproximateUnique()Deprecated.
-
-
Method Details
-
globally
Deprecated.Returns aPTransformthat takes aPCollection<T>and returns aPCollection<Long>containing a single value that is an estimate of the number of distinct elements in the inputPCollection.The
sampleSizeparameter controls the estimation error. The error is about2 / sqrt(sampleSize), so forApproximateUnique.globally(10000)the estimation error is about 2%. Similarly, forApproximateUnique.of(16)the estimation error is about 50%. If there are fewer thansampleSizedistinct elements then the returned result will be exact with extremely high probability (the chance of a hash collision is aboutsampleSize^2 / 2^65).This transform approximates the number of elements in a set by computing the top
sampleSizehash values, and using that to extrapolate the size of the entire set of hash values by assuming the rest of the hash values are as densely distributed as the topsampleSize.See also
globally(double).Example of use:
PCollection<String> pc = ...; PCollection<Long> approxNumDistinct = pc.apply(ApproximateUnique.<String>globally(1000));- Type Parameters:
T- the type of the elements in the inputPCollection- Parameters:
sampleSize- the number of entries in the statistical sample; the higher this number, the more accurate the estimate will be; should be>= 16- Throws:
IllegalArgumentException- if thesampleSizeargument is too small
-
globally
Deprecated.Likeglobally(int), but specifies the desired maximum estimation error instead of the sample size.- Type Parameters:
T- the type of the elements in the inputPCollection- Parameters:
maximumEstimationError- the maximum estimation error, which should be in the range[0.01, 0.5]- Throws:
IllegalArgumentException- if themaximumEstimationErrorargument is out of range
-
perKey
Deprecated.Returns aPTransformthat takes aPCollection<KV<K, V>>and returns aPCollection<KV<K, Long>>that contains an output element mapping each distinct key in the inputPCollectionto an estimate of the number of distinct values associated with that key in the inputPCollection.See
globally(int)for an explanation of thesampleSizeparameter. A separate sampling is computed for each distinct key of the input.See also
perKey(double).Example of use:
PCollection<KV<Integer, String>> pc = ...; PCollection<KV<Integer, Long>> approxNumDistinctPerKey = pc.apply(ApproximateUnique.<Integer, String>perKey(1000));- Type Parameters:
K- the type of the keys in the input and outputPCollectionsV- the type of the values in the inputPCollection- Parameters:
sampleSize- the number of entries in the statistical sample; the higher this number, the more accurate the estimate will be; should be>= 16- Throws:
IllegalArgumentException- if thesampleSizeargument is too small
-
perKey
Deprecated.LikeperKey(int), but specifies the desired maximum estimation error instead of the sample size.- Type Parameters:
K- the type of the keys in the input and outputPCollectionsV- the type of the values in the inputPCollection- Parameters:
maximumEstimationError- the maximum estimation error, which should be in the range[0.01, 0.5]- Throws:
IllegalArgumentException- if themaximumEstimationErrorargument is out of range
-
Consider using
ApproximateCountDistinctin thezetasketchextension module, which makes use of theHllCountimplementation.If
ApproximateCountDistinctdoes not meet your needs then you can directly useHllCount. Direct usage will also give you access to save intermediate aggregation result into a sketch for later processing.For example, to estimate the number of distinct elements in a
For more details about usingPCollection<String>:HllCountand thezetasketchextension module, see https://s.apache.org/hll-in-beam#bookmark=id.v6chsij1ixo7.