Class HllCount
java.lang.Object
org.apache.beam.sdk.extensions.zetasketch.HllCount
PTransform
s to compute HyperLogLogPlusPlus (HLL++) sketches on data streams based on the
ZetaSketch implementation.
HLL++ is an algorithm implemented by Google that estimates the count of distinct elements in a data stream. HLL++ requires significantly less memory than the linear memory needed for exact computation, at the cost of a small error. Cardinalities of arbitrary breakdowns can be computed using the HLL++ sketch. See this published paper for details about the algorithm.
HLL++ functions are also supported in Google Cloud
BigQuery. The HllCount PTransform
s provided here produce and consume sketches
compatible with BigQuery.
For detailed design of this class, see https://s.apache.org/hll-in-beam.
Examples
Example 1: Create long-type sketch for a PCollection<Long>
and specify precision
PCollection<Long> input = ...;
int p = ...;
PCollection<byte[]> sketch = input.apply(HllCount.Init.forLongs().withPrecision(p).globally());
Example 2: Create bytes-type sketch for a PCollection<KV<String, byte[]>>
PCollection<KV<String, byte[]>> input = ...;
PCollection<KV<String, byte[]>> sketch = input.apply(HllCount.Init.forBytes().perKey());
Example 3: Merge existing sketches in a PCollection<byte[]>
into a new one
PCollection<byte[]> sketches = ...;
PCollection<byte[]> mergedSketch = sketches.apply(HllCount.MergePartial.globally());
Example 4: Estimates the count of distinct elements in a PCollection<String>
PCollection<String> input = ...;
PCollection<Long> countDistinct =
input.apply(HllCount.Init.forStrings().globally()).apply(HllCount.Extract.globally());
Note: Currently HllCount does not work on FnAPI workers. See Issue #19698.-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final class
ProvidesPTransform
s to extract the estimated count of distinct elements (asLong
s) from each HLL++ sketch.static final class
ProvidesPTransform
s to aggregate inputs into HLL++ sketches.static final class
ProvidesPTransform
s to merge HLL++ sketches into a new sketch. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final int
The defaultprecision
value used inHllCount.Init.Builder.withPrecision(int)
is 15.static final int
The maximumprecision
value you can set inHllCount.Init.Builder.withPrecision(int)
is 24.static final int
The minimumprecision
value you can set inHllCount.Init.Builder.withPrecision(int)
is 10. -
Method Summary
Modifier and TypeMethodDescriptionstatic byte[]
Converts the passed-in sketch fromByteBuffer
tobyte[]
, mappingnull ByteBuffer
s (representing empty sketches) to emptybyte[]
s.
-
Field Details
-
MINIMUM_PRECISION
public static final int MINIMUM_PRECISIONThe minimumprecision
value you can set inHllCount.Init.Builder.withPrecision(int)
is 10.- See Also:
-
MAXIMUM_PRECISION
public static final int MAXIMUM_PRECISIONThe maximumprecision
value you can set inHllCount.Init.Builder.withPrecision(int)
is 24.- See Also:
-
DEFAULT_PRECISION
public static final int DEFAULT_PRECISIONThe defaultprecision
value used inHllCount.Init.Builder.withPrecision(int)
is 15.- See Also:
-
-
Method Details
-
getSketchFromByteBuffer
Converts the passed-in sketch fromByteBuffer
tobyte[]
, mappingnull ByteBuffer
s (representing empty sketches) to emptybyte[]
s.Utility method to convert sketches materialized with ZetaSQL/BigQuery to valid inputs for Beam
HllCount
transforms.
-