Class ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT extends Comparator<T> & Serializable>

java.lang.Object
org.apache.beam.sdk.transforms.Combine.CombineFn<T,org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT>,List<T>>
org.apache.beam.sdk.transforms.Combine.AccumulatingCombineFn<T,org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT>,List<T>>
org.apache.beam.sdk.transforms.ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT>
Type Parameters:
T - the type of the values being combined
All Implemented Interfaces:
Serializable, CombineFnBase.GlobalCombineFn<T,org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT>,List<T>>, HasDisplayData
Enclosing class:
ApproximateQuantiles

public static class ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT extends Comparator<T> & Serializable> extends Combine.AccumulatingCombineFn<T,org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT>,List<T>>
The ApproximateQuantilesCombineFn combiner gives an idea of the distribution of a collection of values using approximate N-tiles. The output of this combiner is a List of size numQuantiles, containing the input values' minimum value, numQuantiles-2 intermediate values, and maximum value, in sorted order, so for traditional N-tiles, one should use ApproximateQuantilesCombineFn#create(N+1).

If there are fewer values to combine than numQuantiles, then the result List will contain all the values being combined, in sorted order.

Values are ordered using either a specified Comparator or the values' natural ordering.

To evaluate the quantiles we use the "New Algorithm" described here:

   [MRL98] Manku, Rajagopalan & Lindsay, "Approximate Medians and other
   Quantiles in One Pass and with Limited Memory", Proc. 1998 ACM
   SIGMOD, Vol 27, No 2, p 426-435, June 1998.
   http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
 

The default error bound is 1 / N, though in practice the accuracy tends to be much better.

See create(int, Comparator, long, double) for more information about the meaning of epsilon, and withEpsilon(double) for a convenient way to adjust it.

See Also:
  • Field Details

    • DEFAULT_MAX_NUM_ELEMENTS

      public static final long DEFAULT_MAX_NUM_ELEMENTS
      The cost (in time and space) to compute quantiles to a given accuracy is a function of the total number of elements in the data set. If an estimate is not known or specified, we use this as an upper bound. If this is too low, errors may exceed the requested tolerance; if too high, efficiency may be non-optimal. The impact is logarithmic with respect to this value, so this default should be fine for most uses.
      See Also:
  • Method Details

    • create

      public static <T, ComparatorT extends Comparator<T> & Serializable> ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT> create(int numQuantiles, ComparatorT compareFn)
      Returns an approximate quantiles combiner with the given compareFn and desired number of quantiles. A total of numQuantiles elements will appear in the output list, including the minimum and maximum.

      The Comparator must be Serializable.

      The default error bound is 1 / numQuantiles, which holds as long as the number of elements is less than DEFAULT_MAX_NUM_ELEMENTS.

    • create

      public static <T extends Comparable<T>> ApproximateQuantiles.ApproximateQuantilesCombineFn<T,Top.Natural<T>> create(int numQuantiles)
      Like create(int, Comparator), but sorts values using their natural ordering.
    • withEpsilon

      Returns an ApproximateQuantilesCombineFn that's like this one except that it uses the specified epsilon value. Does not modify this combiner.

      See create(int, Comparator, long, double) for more information about the meaning of epsilon.

    • withMaxInputSize

      public ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT> withMaxInputSize(long maxNumElements)
      Returns an ApproximateQuantilesCombineFn that's like this one except that it uses the specified maxNumElements value. Does not modify this combiner.

      See create(int, Comparator, long, double) for more information about the meaning of maxNumElements.

    • create

      public static <T, ComparatorT extends Comparator<T> & Serializable> ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT> create(int numQuantiles, ComparatorT compareFn, long maxNumElements, double epsilon)
      Creates an approximate quantiles combiner with the given compareFn and desired number of quantiles. A total of numQuantiles elements will appear in the output list, including the minimum and maximum.

      The Comparator must be Serializable.

      The default error bound is epsilon, which holds as long as the number of elements is less than maxNumElements. Specifically, if one considers the input as a sorted list x_1, ..., x_N, then the distance between the each exact quantile x_c and its approximation x_c' is bounded by |c - c'| < epsilon * N. Note that these errors are worst-case scenarios; in practice the accuracy tends to be much better.

    • createAccumulator

      public org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT> createAccumulator()
      Description copied from class: Combine.CombineFn
      Returns a new, mutable accumulator value, representing the accumulation of zero input values.
      Specified by:
      createAccumulator in class Combine.CombineFn<T,org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT extends Comparator<T> & Serializable>,List<T>>
    • getAccumulatorCoder

      public Coder<org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT>> getAccumulatorCoder(CoderRegistry registry, Coder<T> elementCoder)
      Description copied from interface: CombineFnBase.GlobalCombineFn
      Returns the Coder to use for accumulator AccumT values, or null if it is not able to be inferred.

      By default, uses the knowledge of the Coder being used for InputT values and the enclosing Pipeline's CoderRegistry to try to infer the Coder for AccumT values.

      This is the Coder used to send data through a communication-intensive shuffle step, so a compact and efficient representation may have significant performance benefits.

      Specified by:
      getAccumulatorCoder in interface CombineFnBase.GlobalCombineFn<T,org.apache.beam.sdk.transforms.ApproximateQuantiles.QuantileState<T,ComparatorT extends Comparator<T> & Serializable>,List<T>>
    • populateDisplayData

      public void populateDisplayData(DisplayData.Builder builder)
      Register display data for the given transform or component.

      populateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect display data via DisplayData.from(HasDisplayData). Implementations may call super.populateDisplayData(builder) in order to register display data in the current namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use the namespace of the subcomponent.

      By default, does not register any display data. Implementors may override this method to provide their own display data.

      Specified by:
      populateDisplayData in interface HasDisplayData
      Parameters:
      builder - The builder to populate with display data.
      See Also:
    • getDefaultOutputCoder

      public Coder<List<T>> getDefaultOutputCoder(CoderRegistry registry, Coder<T> inputCoder) throws CannotProvideCoderException
      Description copied from interface: CombineFnBase.GlobalCombineFn
      Returns the Coder to use by default for output OutputT values, or null if it is not able to be inferred.

      By default, uses the knowledge of the Coder being used for input InputT values and the enclosing Pipeline's CoderRegistry to try to infer the Coder for OutputT values.

      Specified by:
      getDefaultOutputCoder in interface CombineFnBase.GlobalCombineFn<InputT,AccumT,OutputT>
      Throws:
      CannotProvideCoderException
    • getIncompatibleGlobalWindowErrorMessage

      public String getIncompatibleGlobalWindowErrorMessage()
      Description copied from interface: CombineFnBase.GlobalCombineFn
      Returns the error message for not supported default values in Combine.globally().
      Specified by:
      getIncompatibleGlobalWindowErrorMessage in interface CombineFnBase.GlobalCombineFn<InputT,AccumT,OutputT>
    • getInputTVariable

      public TypeVariable<?> getInputTVariable()
      Returns the TypeVariable of InputT.
    • getAccumTVariable

      public TypeVariable<?> getAccumTVariable()
      Returns the TypeVariable of AccumT.
    • getOutputTVariable

      public TypeVariable<?> getOutputTVariable()
      Returns the TypeVariable of OutputT.