public class ApproximateUnique
extends java.lang.Object
PTransform
s for estimating the number of distinct elements
in a PCollection
, or the number of distinct values
associated with each key in a PCollection
of KV
s.Modifier and Type | Class and Description |
---|---|
static class |
ApproximateUnique.ApproximateUniqueCombineFn<T>
CombineFn that computes an estimate of the number of
distinct values that were combined. |
Constructor and Description |
---|
ApproximateUnique() |
Modifier and Type | Method and Description |
---|---|
static <T> org.apache.beam.sdk.transforms.ApproximateUnique.Globally<T> |
globally(double maximumEstimationError)
Like
globally(int) , but specifies the desired maximum
estimation error instead of the sample size. |
static <T> org.apache.beam.sdk.transforms.ApproximateUnique.Globally<T> |
globally(int sampleSize)
Returns a
PTransform that takes a PCollection<T>
and returns a PCollection<Long> containing a single value
that is an estimate of the number of distinct elements in the
input PCollection . |
static <K,V> org.apache.beam.sdk.transforms.ApproximateUnique.PerKey<K,V> |
perKey(double maximumEstimationError)
Like
perKey(int) , but specifies the desired maximum
estimation error instead of the sample size. |
static <K,V> org.apache.beam.sdk.transforms.ApproximateUnique.PerKey<K,V> |
perKey(int sampleSize)
Returns a
PTransform that takes a
PCollection<KV<K, V>> and returns a
PCollection<KV<K, Long>> that contains an output element
mapping each distinct key in the input PCollection to an
estimate of the number of distinct values associated with that
key in the input PCollection . |
public static <T> org.apache.beam.sdk.transforms.ApproximateUnique.Globally<T> globally(int sampleSize)
PTransform
that takes a PCollection<T>
and returns a PCollection<Long>
containing a single value
that is an estimate of the number of distinct elements in the
input PCollection
.
The sampleSize
parameter controls the estimation
error. The error is about 2 / sqrt(sampleSize)
, so for
ApproximateUnique.globally(10000)
the estimation error is
about 2%. Similarly, for ApproximateUnique.of(16)
the
estimation error is about 50%. If there are fewer than
sampleSize
distinct elements then the returned result
will be exact with extremely high probability (the chance of a
hash collision is about sampleSize^2 / 2^65
).
This transform approximates the number of elements in a set
by computing the top sampleSize
hash values, and using
that to extrapolate the size of the entire set of hash values by
assuming the rest of the hash values are as densely distributed
as the top sampleSize
.
See also globally(double)
.
Example of use:
PCollection<String> pc = ...;
PCollection<Long> approxNumDistinct =
pc.apply(ApproximateUnique.<String>globally(1000));
T
- the type of the elements in the input PCollection
sampleSize
- the number of entries in the statistical
sample; the higher this number, the more accurate the
estimate will be; should be >= 16
java.lang.IllegalArgumentException
- if the sampleSize
argument is too smallpublic static <T> org.apache.beam.sdk.transforms.ApproximateUnique.Globally<T> globally(double maximumEstimationError)
globally(int)
, but specifies the desired maximum
estimation error instead of the sample size.T
- the type of the elements in the input PCollection
maximumEstimationError
- the maximum estimation error, which
should be in the range [0.01, 0.5]
java.lang.IllegalArgumentException
- if the
maximumEstimationError
argument is out of rangepublic static <K,V> org.apache.beam.sdk.transforms.ApproximateUnique.PerKey<K,V> perKey(int sampleSize)
PTransform
that takes a
PCollection<KV<K, V>>
and returns a
PCollection<KV<K, Long>>
that contains an output element
mapping each distinct key in the input PCollection
to an
estimate of the number of distinct values associated with that
key in the input PCollection
.
See globally(int)
for an explanation of the
sampleSize
parameter. A separate sampling is computed
for each distinct key of the input.
See also perKey(double)
.
Example of use:
PCollection<KV<Integer, String>> pc = ...;
PCollection<KV<Integer, Long>> approxNumDistinctPerKey =
pc.apply(ApproximateUnique.<Integer, String>perKey(1000));
K
- the type of the keys in the input and output
PCollection
sV
- the type of the values in the input PCollection
sampleSize
- the number of entries in the statistical
sample; the higher this number, the more accurate the
estimate will be; should be >= 16
java.lang.IllegalArgumentException
- if the sampleSize
argument is too smallpublic static <K,V> org.apache.beam.sdk.transforms.ApproximateUnique.PerKey<K,V> perKey(double maximumEstimationError)
perKey(int)
, but specifies the desired maximum
estimation error instead of the sample size.K
- the type of the keys in the input and output
PCollection
sV
- the type of the values in the input PCollection
maximumEstimationError
- the maximum estimation error, which
should be in the range [0.01, 0.5]
java.lang.IllegalArgumentException
- if the
maximumEstimationError
argument is out of range