java.lang.Object
org.apache.beam.sdk.extensions.zetasketch.HllCount

public final class HllCount extends Object
PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data streams based on the ZetaSketch implementation.

HLL++ is an algorithm implemented by Google that estimates the count of distinct elements in a data stream. HLL++ requires significantly less memory than the linear memory needed for exact computation, at the cost of a small error. Cardinalities of arbitrary breakdowns can be computed using the HLL++ sketch. See this published paper for details about the algorithm.

HLL++ functions are also supported in Google Cloud BigQuery. The HllCount PTransforms provided here produce and consume sketches compatible with BigQuery.

For detailed design of this class, see https://s.apache.org/hll-in-beam.

Examples

Example 1: Create long-type sketch for a PCollection<Long> and specify precision


 PCollection<Long> input = ...;
 int p = ...;
 PCollection<byte[]> sketch = input.apply(HllCount.Init.forLongs().withPrecision(p).globally());
 

Example 2: Create bytes-type sketch for a PCollection<KV<String, byte[]>>


 PCollection<KV<String, byte[]>> input = ...;
 PCollection<KV<String, byte[]>> sketch = input.apply(HllCount.Init.forBytes().perKey());
 

Example 3: Merge existing sketches in a PCollection<byte[]> into a new one


 PCollection<byte[]> sketches = ...;
 PCollection<byte[]> mergedSketch = sketches.apply(HllCount.MergePartial.globally());
 

Example 4: Estimates the count of distinct elements in a PCollection<String>


 PCollection<String> input = ...;
 PCollection<Long> countDistinct =
     input.apply(HllCount.Init.forStrings().globally()).apply(HllCount.Extract.globally());
 
Note: Currently HllCount does not work on FnAPI workers. See Issue #19698.