Class ApproximateCountDistinct

java.lang.Object
org.apache.beam.sdk.extensions.zetasketch.ApproximateCountDistinct

public class ApproximateCountDistinct extends Object
PTransforms for estimating the number of distinct elements in a PCollection, or the number of distinct values associated with each key in a PCollection of KVs.

We make use of the HllCount implementation for this transform. Please use HllCount directly if you need access to the sketches.

If the object is not one of Byte Integer Double String make use of ApproximateCountDistinct.Globally.via(org.apache.beam.sdk.transforms.ProcessFunction<T, java.lang.Long>) or ApproximateCountDistinct.PerKey.via(org.apache.beam.sdk.transforms.ProcessFunction<org.apache.beam.sdk.values.KV<K, V>, org.apache.beam.sdk.values.KV<K, java.lang.Long>>)

Examples

Example 1: Approximate Count of Ints PCollection<Integer> and specify precision


 p.apply("Int", Create.of(ints)).apply("IntHLL", ApproximateCountDistinct.globally()
   .withPercision(PRECISION));

 

Example 2: Approximate Count of Key Value PCollection<KV<Integer,Foo>>


 PCollection<KV<Integer, Long>> result =
   p.apply("Long", Create.of(longs)).apply("LongHLL", ApproximateCountDistinct.perKey());

 

Example 3: Approximate Count of Key Value PCollection<KV<Integer,Foo>>


 PCollection<KV<Integer, Foo>> approxResultInteger =
   p.apply("Int", Create.of(Foo))
     .apply("IntHLL", ApproximateCountDistinct.<Integer, KV<Integer, Integer>>perKey()
       .via(kv -> KV.of(kv.getKey(), (long) kv.getValue().hashCode())));