Consider using ApproximateCountDistinct in the zetasketch extension
     module, which makes use of the HllCount implementation.
     
If ApproximateCountDistinct does not meet your needs then you can directly use
     HllCount. Direct usage will also give you access to save intermediate aggregation
     result into a sketch for later processing.
     
For example, to estimate the number of distinct elements in a PCollection<String>:
     
 PCollection<String> input = ...;
 PCollection<Long> countDistinct =
     input.apply(HllCount.Init.forStrings().globally()).apply(HllCount.Extract.globally());
 HllCount and the zetasketch extension module,
     see https://s.apache.org/hll-in-beam#bookmark=id.v6chsij1ixo7.@Deprecated
public class ApproximateUnique
extends java.lang.Object
PTransforms for estimating the number of distinct elements in a PCollection, or
 the number of distinct values associated with each key in a PCollection of KVs.| Modifier and Type | Class and Description | 
|---|---|
| static class  | ApproximateUnique.ApproximateUniqueCombineFn<T>Deprecated.  CombineFnthat computes an estimate of the number of distinct values that were
 combined. | 
| static class  | ApproximateUnique.Globally<T>Deprecated.  PTransformfor estimating the number of distinct elements in aPCollection. | 
| static class  | ApproximateUnique.PerKey<K,V>Deprecated.  PTransformfor estimating the number of distinct values associated with each key in aPCollectionofKVs. | 
| Constructor and Description | 
|---|
| ApproximateUnique()Deprecated.  | 
| Modifier and Type | Method and Description | 
|---|---|
| static <T> ApproximateUnique.Globally<T> | globally(double maximumEstimationError)Deprecated.  Like  globally(int), but specifies the desired maximum estimation error instead of the
 sample size. | 
| static <T> ApproximateUnique.Globally<T> | globally(int sampleSize)Deprecated.  Returns a  PTransformthat takes aPCollection<T>and returns aPCollection<Long>containing a single value that is an estimate of the number of distinct
 elements in the inputPCollection. | 
| static <K,V> ApproximateUnique.PerKey<K,V> | perKey(double maximumEstimationError)Deprecated.  Like  perKey(int), but specifies the desired maximum estimation error instead of the
 sample size. | 
| static <K,V> ApproximateUnique.PerKey<K,V> | perKey(int sampleSize)Deprecated.  Returns a  PTransformthat takes aPCollection<KV<K, V>>and returns aPCollection<KV<K, Long>>that contains an output element mapping each distinct key in the
 inputPCollectionto an estimate of the number of distinct values associated with that
 key in the inputPCollection. | 
public static <T> ApproximateUnique.Globally<T> globally(int sampleSize)
PTransform that takes a PCollection<T> and returns a PCollection<Long> containing a single value that is an estimate of the number of distinct
 elements in the input PCollection.
 The sampleSize parameter controls the estimation error. The error is about 2
 / sqrt(sampleSize), so for ApproximateUnique.globally(10000) the estimation error is
 about 2%. Similarly, for ApproximateUnique.of(16) the estimation error is about 50%. If
 there are fewer than sampleSize distinct elements then the returned result will be
 exact with extremely high probability (the chance of a hash collision is about sampleSize^2 / 2^65).
 
This transform approximates the number of elements in a set by computing the top sampleSize hash values, and using that to extrapolate the size of the entire set of hash
 values by assuming the rest of the hash values are as densely distributed as the top sampleSize.
 
See also globally(double).
 
Example of use:
 PCollection<String> pc = ...;
 PCollection<Long> approxNumDistinct =
     pc.apply(ApproximateUnique.<String>globally(1000));
 T - the type of the elements in the input PCollectionsampleSize - the number of entries in the statistical sample; the higher this number, the
     more accurate the estimate will be; should be >= 16java.lang.IllegalArgumentException - if the sampleSize argument is too smallpublic static <T> ApproximateUnique.Globally<T> globally(double maximumEstimationError)
globally(int), but specifies the desired maximum estimation error instead of the
 sample size.T - the type of the elements in the input PCollectionmaximumEstimationError - the maximum estimation error, which should be in the range [0.01, 0.5]java.lang.IllegalArgumentException - if the maximumEstimationError argument is out of rangepublic static <K,V> ApproximateUnique.PerKey<K,V> perKey(int sampleSize)
PTransform that takes a PCollection<KV<K, V>> and returns a PCollection<KV<K, Long>> that contains an output element mapping each distinct key in the
 input PCollection to an estimate of the number of distinct values associated with that
 key in the input PCollection.
 See globally(int) for an explanation of the sampleSize parameter. A
 separate sampling is computed for each distinct key of the input.
 
See also perKey(double).
 
Example of use:
 PCollection<KV<Integer, String>> pc = ...;
 PCollection<KV<Integer, Long>> approxNumDistinctPerKey =
     pc.apply(ApproximateUnique.<Integer, String>perKey(1000));
 K - the type of the keys in the input and output PCollectionsV - the type of the values in the input PCollectionsampleSize - the number of entries in the statistical sample; the higher this number, the
     more accurate the estimate will be; should be >= 16java.lang.IllegalArgumentException - if the sampleSize argument is too smallpublic static <K,V> ApproximateUnique.PerKey<K,V> perKey(double maximumEstimationError)
perKey(int), but specifies the desired maximum estimation error instead of the
 sample size.K - the type of the keys in the input and output PCollectionsV - the type of the values in the input PCollectionmaximumEstimationError - the maximum estimation error, which should be in the range [0.01, 0.5]java.lang.IllegalArgumentException - if the maximumEstimationError argument is out of range