public final class TDigestQuantiles
extends java.lang.Object
PTransform
s for getting information about quantiles in a stream.
This class uses the T-Digest structure introduced by Ted Dunning, and more precisely the
MergingDigest
implementation.
The paper and implementation are available on Ted Dunning's Github profile
Only one parameter can be tuned in order to control the tradeoff between the estimation
accuracy and the memory use.
Stream elements are compressed into a linked list of centroids. The compression factor cf
is used to limit the number of elements represented by each centroid as well as the total
number of centroids.
The relative error will always be a small fraction of 1% for values at extreme quantiles and
always be less than 3/cf at middle quantiles.
By default the compression factor is set to 100, which guarantees a relative error less than 3%.
There are 2 ways of using this class:
PTransform
s that return a PCollection
which contains a MergingDigest
for querying the value at a given quantile or the approximate quantile
position of an element.
TDigestQuantiles.TDigestQuantilesFn
CombineFn
that is exposed in order to make
advanced processing involving the MergingDigest
.
The simplest use is to call the globally()
or perKey()
method in order to
retrieve the digest, and then to query the structure.
PCollection<Double> pc = ...;
PCollection<MergingDigest> countMinSketch = pc.apply(TDigestQuantiles
.globally()); // .perKey()
One can tune the compression factor cf
in order to control accuracy and memory.
This tuning works exactly the same for globally()
and perKey()
.
double cf = 500;
PCollection<Double> pc = ...;
PCollection<MergingDigest> countMinSketch = pc.apply(TDigestQuantiles
.globally() // .perKey()
.withCompression(cf);
This example shows how to query the resulting structure, for example to build PCollection
of KV
s with each pair corresponding to a couple (quantile, value).
PCollection<MergingDigest> pc = ...;
PCollection<KV<Double, Double>> quantiles = pc.apply(ParDo.of(
new DoFn<MergingDigest, KV<Double, Double>>() {
@ProcessElement
public void processElement(ProcessContext c) {
double[] quantiles = {0.01, 0.25, 0.5, 0.75, 0.99}
for (double q : quantiles) {
c.output(KV.of(q, c.element().quantile(q));
}
}}));
One can also retrieve the approximate quantile position of a given element in the stream using
cdf(double)
method instead of quantile(double)
.
The CombineFn
does the same thing as the PTransform
s but it can be used for
doing stateful processing or in CombineFns.ComposedCombineFn
.
This example is not really interesting but it shows how one can properly create a TDigestQuantiles.TDigestQuantilesFn
.
double cf = 250;
PCollection<Double> input = ...;
PCollection<MergingDigest> output = input.apply(Combine
.globally(TDigestQuantilesFn.create(cf)));
Modifier and Type | Class and Description |
---|---|
static class |
TDigestQuantiles.GlobalDigest
Implementation of
globally() . |
static class |
TDigestQuantiles.PerKeyDigest<K>
Implementation of
perKey() . |
static class |
TDigestQuantiles.TDigestQuantilesFn
Implements the
Combine.CombineFn of TDigestQuantiles transforms. |
Constructor and Description |
---|
TDigestQuantiles() |
Modifier and Type | Method and Description |
---|---|
static TDigestQuantiles.GlobalDigest |
globally()
Compute the stream in order to build a T-Digest structure (MergingDigest) for keeping track of
the stream distribution and returns a
PCollection<MergingDigest> . |
static <K> TDigestQuantiles.PerKeyDigest<K> |
perKey()
Like
globally() , but builds a digest for each key in the stream. |
public static TDigestQuantiles.GlobalDigest globally()
PCollection<MergingDigest>
. public static <K> TDigestQuantiles.PerKeyDigest<K> perKey()
globally()
, but builds a digest for each key in the stream.K
- the type of the keys