Class ApproximateQuantiles

java.lang.Object
org.apache.beam.sdk.transforms.ApproximateQuantiles

public class ApproximateQuantiles extends Object
PTransforms for getting an idea of a PCollection's data distribution using approximate N-tiles (e.g. quartiles, percentiles, etc.), either globally or per-key.
  • Method Details

    • globally

      public static <T, ComparatorT extends Comparator<T> & Serializable> PTransform<PCollection<T>,PCollection<List<T>>> globally(int numQuantiles, ComparatorT compareFn)
      Returns a PTransform that takes a PCollection<T> and returns a PCollection<List<T>> whose single value is a List of the approximate N-tiles of the elements of the input PCollection. This gives an idea of the distribution of the input elements.

      The computed List is of size numQuantiles, and contains the input elements' minimum value, numQuantiles-2 intermediate values, and maximum value, in sorted order, using the given Comparator to order values. To compute traditional N-tiles, one should use ApproximateQuantiles.globally(N+1, compareFn).

      If there are fewer input elements than numQuantiles, then the result List will contain all the input elements, in sorted order.

      The argument Comparator must be Serializable.

      Example of use:

      
       PCollection<String> pc = ...;
       PCollection<List<String>> quantiles =
           pc.apply(ApproximateQuantiles.globally(11, stringCompareFn));
       
      Type Parameters:
      T - the type of the elements in the input PCollection
      Parameters:
      numQuantiles - the number of elements in the resulting quantile values List
      compareFn - the function to use to order the elements
    • globally

      public static <T extends Comparable<T>> PTransform<PCollection<T>,PCollection<List<T>>> globally(int numQuantiles)
      Like globally(int, Comparator), but sorts using the elements' natural ordering.
      Type Parameters:
      T - the type of the elements in the input PCollection
      Parameters:
      numQuantiles - the number of elements in the resulting quantile values List
    • perKey

      public static <K, V, ComparatorT extends Comparator<V> & Serializable> PTransform<PCollection<KV<K,V>>,PCollection<KV<K,List<V>>>> perKey(int numQuantiles, ComparatorT compareFn)
      Returns a PTransform that takes a PCollection<KV<K, V>> and returns a PCollection<KV<K, List<V>>> that contains an output element mapping each distinct key in the input PCollection to a List of the approximate N-tiles of the values associated with that key in the input PCollection. This gives an idea of the distribution of the input values for each key.

      Each of the computed Lists is of size numQuantiles, and contains the input values' minimum value, numQuantiles-2 intermediate values, and maximum value, in sorted order, using the given Comparator to order values. To compute traditional N-tiles, one should use ApproximateQuantiles.perKey(compareFn, N+1).

      If a key has fewer than numQuantiles values associated with it, then that key's output List will contain all the key's input values, in sorted order.

      The argument Comparator must be Serializable.

      Example of use:

      
       PCollection<KV<Integer, String>> pc = ...;
       PCollection<KV<Integer, List<String>>> quantilesPerKey =
           pc.apply(ApproximateQuantiles.<Integer, String>perKey(stringCompareFn, 11));
       

      See Combine.PerKey for how this affects timestamps and windowing.

      Type Parameters:
      K - the type of the keys in the input and output PCollections
      V - the type of the values in the input PCollection
      Parameters:
      numQuantiles - the number of elements in the resulting quantile values List
      compareFn - the function to use to order the elements
    • perKey

      public static <K, V extends Comparable<V>> PTransform<PCollection<KV<K,V>>,PCollection<KV<K,List<V>>>> perKey(int numQuantiles)
      Like perKey(int, Comparator), but sorts values using their natural ordering.
      Type Parameters:
      K - the type of the keys in the input and output PCollections
      V - the type of the values in the input PCollection
      Parameters:
      numQuantiles - the number of elements in the resulting quantile values List