public class Sample
extends java.lang.Object
PTransform
s for taking samples of the elements in a PCollection
, or samples of
the values associated with each key in a PCollection
of KV
s.
fixedSizeGlobally(int)
and fixedSizePerKey(int)
compute uniformly random
samples. any(long)
is faster, but provides no uniformity guarantees.
combineFn(int)
can also be used manually, in combination with state and with the Combine
transform.
Modifier and Type | Class and Description |
---|---|
static class |
Sample.FixedSizedSampleFn<T>
CombineFn that computes a fixed-size sample of a collection of values. |
Constructor and Description |
---|
Sample() |
Modifier and Type | Method and Description |
---|---|
static <T> PTransform<PCollection<T>,PCollection<T>> |
any(long limit)
Sample#any(long) takes a PCollection<T> and a limit, and produces a new PCollection<T> containing up to limit elements of the input PCollection . |
static <T> Combine.CombineFn<T,?,java.lang.Iterable<T>> |
anyCombineFn(int sampleSize)
Returns a
Combine.CombineFn that computes a fixed-sized potentially non-uniform sample of its
inputs. |
static <T> Combine.CombineFn<T,?,T> |
anyValueCombineFn()
Returns a
Combine.CombineFn that computes a single and potentially non-uniform sample value of
its inputs. |
static <T> Combine.CombineFn<T,?,java.lang.Iterable<T>> |
combineFn(int sampleSize)
Returns a
Combine.CombineFn that computes a fixed-sized uniform sample of its inputs. |
static <T> PTransform<PCollection<T>,PCollection<java.lang.Iterable<T>>> |
fixedSizeGlobally(int sampleSize)
Returns a
PTransform that takes a PCollection<T> , selects sampleSize
elements, uniformly at random, and returns a PCollection<Iterable<T>> containing the
selected elements. |
static <K,V> PTransform<PCollection<KV<K,V>>,PCollection<KV<K,java.lang.Iterable<V>>>> |
fixedSizePerKey(int sampleSize)
Returns a
PTransform that takes an input PCollection<KV<K, V>> and returns a
PCollection<KV<K, Iterable<V>>> that contains an output element mapping each distinct
key in the input PCollection to a sample of sampleSize values associated with
that key in the input PCollection , taken uniformly at random. |
public static <T> Combine.CombineFn<T,?,java.lang.Iterable<T>> combineFn(int sampleSize)
Combine.CombineFn
that computes a fixed-sized uniform sample of its inputs.public static <T> Combine.CombineFn<T,?,java.lang.Iterable<T>> anyCombineFn(int sampleSize)
Combine.CombineFn
that computes a fixed-sized potentially non-uniform sample of its
inputs.public static <T> Combine.CombineFn<T,?,T> anyValueCombineFn()
Combine.CombineFn
that computes a single and potentially non-uniform sample value of
its inputs.public static <T> PTransform<PCollection<T>,PCollection<T>> any(long limit)
Sample#any(long)
takes a PCollection<T>
and a limit, and produces a new PCollection<T>
containing up to limit elements of the input PCollection
.
If limit is greater than or equal to the size of the input PCollection
, then all the
input's elements will be selected.
Example of use:
PCollection<String> input = ...;
PCollection<String> output = input.apply(Sample.<String>any(100));
T
- the type of the elements of the input and output PCollection
slimit
- the number of elements to take from the inputpublic static <T> PTransform<PCollection<T>,PCollection<java.lang.Iterable<T>>> fixedSizeGlobally(int sampleSize)
PTransform
that takes a PCollection<T>
, selects sampleSize
elements, uniformly at random, and returns a PCollection<Iterable<T>>
containing the
selected elements. If the input PCollection
has fewer than sampleSize
elements,
then the output Iterable<T>
will be all the input's elements.
All of the elements of the output PCollection
should fit into main memory of a
single worker machine. This operation does not run in parallel.
Example of use:
PCollection<String> pc = ...;
PCollection<Iterable<String>> sampleOfSize10 =
pc.apply(Sample.fixedSizeGlobally(10));
T
- the type of the elementssampleSize
- the number of elements to select; must be >= 0
public static <K,V> PTransform<PCollection<KV<K,V>>,PCollection<KV<K,java.lang.Iterable<V>>>> fixedSizePerKey(int sampleSize)
PTransform
that takes an input PCollection<KV<K, V>>
and returns a
PCollection<KV<K, Iterable<V>>>
that contains an output element mapping each distinct
key in the input PCollection
to a sample of sampleSize
values associated with
that key in the input PCollection
, taken uniformly at random. If a key in the input
PCollection
has fewer than sampleSize
values associated with it, then the
output Iterable<V>
associated with that key will be all the values associated with that
key in the input PCollection
.
Example of use:
PCollection<KV<String, Integer>> pc = ...;
PCollection<KV<String, Iterable<Integer>>> sampleOfSize10PerKey =
pc.apply(Sample.<String, Integer>fixedSizePerKey());
K
- the type of the keysV
- the type of the valuessampleSize
- the number of values to select for each distinct key; must be >= 0