public class Sample
extends java.lang.Object
PTransform
s for taking samples of the elements in a
PCollection
, or samples of the values associated with each
key in a PCollection
of KV
s.
combineFn(int)
can also be used manually, in combination with state and with the
Combine
transform.
Modifier and Type | Class and Description |
---|---|
static class |
Sample.FixedSizedSampleFn<T>
CombineFn that computes a fixed-size sample of a
collection of values. |
Constructor and Description |
---|
Sample() |
Modifier and Type | Method and Description |
---|---|
static <T> PTransform<PCollection<T>,PCollection<T>> |
any(long limit)
Sample#any(long) takes a PCollection<T> and a limit, and
produces a new PCollection<T> containing up to limit
elements of the input PCollection . |
static <T> Combine.CombineFn<T,?,java.lang.Iterable<T>> |
combineFn(int sampleSize)
Returns a
Combine.CombineFn that computes a fixed-sized sample of its inputs. |
static <T> PTransform<PCollection<T>,PCollection<java.lang.Iterable<T>>> |
fixedSizeGlobally(int sampleSize)
Returns a
PTransform that takes a PCollection<T> , selects sampleSize
elements, uniformly at random, and returns a PCollection<Iterable<T>> containing the
selected elements. |
static <K,V> PTransform<PCollection<KV<K,V>>,PCollection<KV<K,java.lang.Iterable<V>>>> |
fixedSizePerKey(int sampleSize)
Returns a
PTransform that takes an input PCollection<KV<K, V>> and returns a
PCollection<KV<K, Iterable<V>>> that contains an output element mapping each distinct
key in the input PCollection to a sample of sampleSize values associated with
that key in the input PCollection , taken uniformly at random. |
public static <T> Combine.CombineFn<T,?,java.lang.Iterable<T>> combineFn(int sampleSize)
Combine.CombineFn
that computes a fixed-sized sample of its inputs.public static <T> PTransform<PCollection<T>,PCollection<T>> any(long limit)
Sample#any(long)
takes a PCollection<T>
and a limit, and
produces a new PCollection<T>
containing up to limit
elements of the input PCollection
.
If limit is greater than or equal to the size of the input
PCollection
, then all the input's elements will be selected.
All of the elements of the output PCollection
should fit into
main memory of a single worker machine. This operation does not
run in parallel.
Example of use:
PCollection<String> input = ...;
PCollection<String> output = input.apply(Sample.<String>any(100));
T
- the type of the elements of the input and output
PCollection
slimit
- the number of elements to take from the inputpublic static <T> PTransform<PCollection<T>,PCollection<java.lang.Iterable<T>>> fixedSizeGlobally(int sampleSize)
PTransform
that takes a PCollection<T>
, selects sampleSize
elements, uniformly at random, and returns a PCollection<Iterable<T>>
containing the
selected elements. If the input PCollection
has fewer than sampleSize
elements,
then the output Iterable<T>
will be all the input's elements.
All of the elements of the output PCollection
should fit into
main memory of a single worker machine. This operation does not
run in parallel.
Example of use:
PCollection<String> pc = ...;
PCollection<Iterable<String>> sampleOfSize10 =
pc.apply(Sample.fixedSizeGlobally(10));
T
- the type of the elementssampleSize
- the number of elements to select; must be >= 0
public static <K,V> PTransform<PCollection<KV<K,V>>,PCollection<KV<K,java.lang.Iterable<V>>>> fixedSizePerKey(int sampleSize)
PTransform
that takes an input PCollection<KV<K, V>>
and returns a
PCollection<KV<K, Iterable<V>>>
that contains an output element mapping each distinct
key in the input PCollection
to a sample of sampleSize
values associated with
that key in the input PCollection
, taken uniformly at random. If a key in the input
PCollection
has fewer than sampleSize
values associated with it, then the
output Iterable<V>
associated with that key will be all the values associated with that
key in the input PCollection
.
Example of use:
PCollection<KV<String, Integer>> pc = ...;
PCollection<KV<String, Iterable<Integer>>> sampleOfSize10PerKey =
pc.apply(Sample.<String, Integer>fixedSizePerKey());
K
- the type of the keysV
- the type of the valuessampleSize
- the number of values to select for each distinct key; must be >= 0