Class ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT extends Comparator<T> & Serializable>
- Type Parameters:
T
- the type of the values being combined
- All Implemented Interfaces:
Serializable
,CombineFnBase.GlobalCombineFn<T,
,org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T, ComparatorT>, List<T>> HasDisplayData
- Enclosing class:
ApproximateQuantiles
ApproximateQuantilesCombineFn
combiner gives an idea of the distribution of a
collection of values using approximate N
-tiles. The output of this combiner is a
List
of size numQuantiles
, containing the input values' minimum value,
numQuantiles-2
intermediate values, and maximum value, in sorted order, so for traditional
N
-tiles, one should use ApproximateQuantilesCombineFn#create(N+1)
.
If there are fewer values to combine than numQuantiles
, then the result List
will contain all the values being combined, in sorted order.
Values are ordered using either a specified Comparator
or the values' natural
ordering.
To evaluate the quantiles we use the "New Algorithm" described here:
[MRL98] Manku, Rajagopalan & Lindsay, "Approximate Medians and other Quantiles in One Pass and with Limited Memory", Proc. 1998 ACM SIGMOD, Vol 27, No 2, p 426-435, June 1998. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
The default error bound is 1 / N
, though in practice the accuracy tends to be much
better.
See create(int, Comparator, long, double)
for more information about the meaning of
epsilon
, and withEpsilon(double)
for a convenient way to adjust it.
- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.beam.sdk.transforms.Combine.AccumulatingCombineFn
Combine.AccumulatingCombineFn.Accumulator<InputT,
AccumT, OutputT> -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final long
The cost (in time and space) to compute quantiles to a given accuracy is a function of the total number of elements in the data set. -
Method Summary
Modifier and TypeMethodDescriptionstatic <T extends Comparable<T>>
ApproximateQuantiles.ApproximateQuantilesCombineFn<T, Top.Natural<T>> create
(int numQuantiles) Likecreate(int, Comparator)
, but sorts values using their natural ordering.static <T,
ComparatorT extends Comparator<T> & Serializable>
ApproximateQuantiles.ApproximateQuantilesCombineFn<T, ComparatorT> create
(int numQuantiles, ComparatorT compareFn) Returns an approximate quantiles combiner with the givencompareFn
and desired number of quantiles.static <T,
ComparatorT extends Comparator<T> & Serializable>
ApproximateQuantiles.ApproximateQuantilesCombineFn<T, ComparatorT> create
(int numQuantiles, ComparatorT compareFn, long maxNumElements, double epsilon) Creates an approximate quantiles combiner with the givencompareFn
and desired number of quantiles.org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState
<T, ComparatorT> Returns a new, mutable accumulator value, representing the accumulation of zero input values.TypeVariable
<?> Returns theTypeVariable
ofAccumT
.Coder
<org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T, ComparatorT>> getAccumulatorCoder
(CoderRegistry registry, Coder<T> elementCoder) Returns theCoder
to use for accumulatorAccumT
values, or null if it is not able to be inferred.getDefaultOutputCoder
(CoderRegistry registry, Coder<T> inputCoder) Returns theCoder
to use by default for outputOutputT
values, or null if it is not able to be inferred.Returns the error message for not supported default values in Combine.globally().TypeVariable
<?> Returns theTypeVariable
ofInputT
.TypeVariable
<?> Returns theTypeVariable
ofOutputT
.void
populateDisplayData
(DisplayData.Builder builder) Register display data for the given transform or component.withEpsilon
(double epsilon) Returns anApproximateQuantilesCombineFn
that's like this one except that it uses the specifiedepsilon
value.withMaxInputSize
(long maxNumElements) Returns anApproximateQuantilesCombineFn
that's like this one except that it uses the specifiedmaxNumElements
value.Methods inherited from class org.apache.beam.sdk.transforms.Combine.AccumulatingCombineFn
addInput, extractOutput, mergeAccumulators
Methods inherited from class org.apache.beam.sdk.transforms.Combine.CombineFn
apply, compact, defaultValue, getInputType, getOutputType
-
Field Details
-
DEFAULT_MAX_NUM_ELEMENTS
public static final long DEFAULT_MAX_NUM_ELEMENTSThe cost (in time and space) to compute quantiles to a given accuracy is a function of the total number of elements in the data set. If an estimate is not known or specified, we use this as an upper bound. If this is too low, errors may exceed the requested tolerance; if too high, efficiency may be non-optimal. The impact is logarithmic with respect to this value, so this default should be fine for most uses.- See Also:
-
-
Method Details
-
create
public static <T,ComparatorT extends Comparator<T> & Serializable> ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT> create(int numQuantiles, ComparatorT compareFn) Returns an approximate quantiles combiner with the givencompareFn
and desired number of quantiles. A total ofnumQuantiles
elements will appear in the output list, including the minimum and maximum.The
Comparator
must beSerializable
.The default error bound is
1 / numQuantiles
, which holds as long as the number of elements is less thanDEFAULT_MAX_NUM_ELEMENTS
. -
create
public static <T extends Comparable<T>> ApproximateQuantiles.ApproximateQuantilesCombineFn<T,Top.Natural<T>> create(int numQuantiles) Likecreate(int, Comparator)
, but sorts values using their natural ordering. -
withEpsilon
public ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT> withEpsilon(double epsilon) Returns anApproximateQuantilesCombineFn
that's like this one except that it uses the specifiedepsilon
value. Does not modify this combiner.See
create(int, Comparator, long, double)
for more information about the meaning ofepsilon
. -
withMaxInputSize
public ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT> withMaxInputSize(long maxNumElements) Returns anApproximateQuantilesCombineFn
that's like this one except that it uses the specifiedmaxNumElements
value. Does not modify this combiner.See
create(int, Comparator, long, double)
for more information about the meaning ofmaxNumElements
. -
create
public static <T,ComparatorT extends Comparator<T> & Serializable> ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT> create(int numQuantiles, ComparatorT compareFn, long maxNumElements, double epsilon) Creates an approximate quantiles combiner with the givencompareFn
and desired number of quantiles. A total ofnumQuantiles
elements will appear in the output list, including the minimum and maximum.The
Comparator
must beSerializable
.The default error bound is
epsilon
, which holds as long as the number of elements is less thanmaxNumElements
. Specifically, if one considers the input as a sorted list x_1, ..., x_N, then the distance between the each exact quantile x_c and its approximation x_c' is bounded by|c - c'| < epsilon * N
. Note that these errors are worst-case scenarios; in practice the accuracy tends to be much better. -
createAccumulator
public org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT> createAccumulator()Description copied from class:Combine.CombineFn
Returns a new, mutable accumulator value, representing the accumulation of zero input values.- Specified by:
createAccumulator
in classCombine.CombineFn<T,
org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T, ComparatorT extends Comparator<T> & Serializable>, List<T>>
-
getAccumulatorCoder
public Coder<org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT>> getAccumulatorCoder(CoderRegistry registry, Coder<T> elementCoder) Description copied from interface:CombineFnBase.GlobalCombineFn
Returns theCoder
to use for accumulatorAccumT
values, or null if it is not able to be inferred.By default, uses the knowledge of the
Coder
being used forInputT
values and the enclosingPipeline
'sCoderRegistry
to try to infer the Coder forAccumT
values.This is the Coder used to send data through a communication-intensive shuffle step, so a compact and efficient representation may have significant performance benefits.
- Specified by:
getAccumulatorCoder
in interfaceCombineFnBase.GlobalCombineFn<T,
org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T, ComparatorT extends Comparator<T> & Serializable>, List<T>>
-
populateDisplayData
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Parameters:
builder
- The builder to populate with display data.- See Also:
-
getDefaultOutputCoder
public Coder<List<T>> getDefaultOutputCoder(CoderRegistry registry, Coder<T> inputCoder) throws CannotProvideCoderException Description copied from interface:CombineFnBase.GlobalCombineFn
Returns theCoder
to use by default for outputOutputT
values, or null if it is not able to be inferred.By default, uses the knowledge of the
Coder
being used for inputInputT
values and the enclosingPipeline
'sCoderRegistry
to try to infer the Coder forOutputT
values.- Specified by:
getDefaultOutputCoder
in interfaceCombineFnBase.GlobalCombineFn<InputT,
AccumT, OutputT> - Throws:
CannotProvideCoderException
-
getIncompatibleGlobalWindowErrorMessage
Description copied from interface:CombineFnBase.GlobalCombineFn
Returns the error message for not supported default values in Combine.globally().- Specified by:
getIncompatibleGlobalWindowErrorMessage
in interfaceCombineFnBase.GlobalCombineFn<InputT,
AccumT, OutputT>
-
getInputTVariable
Returns theTypeVariable
ofInputT
. -
getAccumTVariable
Returns theTypeVariable
ofAccumT
. -
getOutputTVariable
Returns theTypeVariable
ofOutputT
.
-