Class PCollection<T>
- Type Parameters:
T
- the type of the elements of thisPCollection
PCollection<T>
is an immutable collection of values of type
T
. A PCollection
can contain either a bounded or unbounded number of elements. Bounded
and unbounded PCollections
are produced as the output of PTransforms
(including root PTransforms like Read
and Create
), and can be passed
as the inputs of other PTransforms.
Some root transforms produce bounded PCollections
and others produce unbounded ones.
For example, GenerateSequence.from(long)
with GenerateSequence.to(long)
produces a fixed set
of integers, so it produces a bounded PCollection
. GenerateSequence.from(long)
without
a GenerateSequence.to(long)
produces all integers as an infinite stream, so it produces an
unbounded PCollection
.
Each element in a PCollection
has an associated timestamp. Readers assign timestamps
to elements when they create PCollections
, and other PTransforms
propagate these timestamps from their input to their output. See the documentation
on BoundedSource.BoundedReader
and UnboundedSource.UnboundedReader
for more information on how these readers
produce timestamps and watermarks.
Additionally, a PCollection
has an associated WindowFn
and each element is
assigned to a set of windows. By default, the windowing function is GlobalWindows
and all
elements are assigned into a single default window. This default can be overridden with the
Window
PTransform
.
See the individual PTransform
subclasses for specific information on how they
propagate timestamps and windowing.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enum
The enumeration of cases for whether aPCollection
is bounded. -
Method Summary
Modifier and TypeMethodDescription<OutputT extends POutput>
OutputTapply
(String name, PTransform<? super PCollection<T>, OutputT> t) Applies the givenPTransform
to this inputPCollection
, usingname
to identify this specific application of the transform.<OutputT extends POutput>
OutputTapply
(PTransform<? super PCollection<T>, OutputT> t) of thePTransform
.static <T> PCollection
<T> createPrimitiveOutputInternal
(Pipeline pipeline, WindowingStrategy<?, ?> windowingStrategy, PCollection.IsBounded isBounded, @Nullable Coder<T> coder) For internal use only; no backwards-compatibility guarantees.static <T> PCollection
<T> createPrimitiveOutputInternal
(Pipeline pipeline, WindowingStrategy<?, ?> windowingStrategy, PCollection.IsBounded isBounded, @Nullable Coder<T> coder, TupleTag<?> tag) For internal use only; no backwards-compatibility guarantees.expand()
void
finishSpecifying
(PInput input, PTransform<?, ?> transform) After building, finalizes thisPValue
to make it ready for running.void
finishSpecifyingOutput
(String transformName, PInput input, PTransform<?, ?> transform) As part of applying the producingPTransform
, finalizes this output to make it ready for being used as an input and for running.getCoder()
Returns theCoder
used by thisPCollection
to encode and decode the values stored in it.Returns the attached schema's fromRowFunction.getName()
Returns the name of thisPCollection
.Returns the attached schema.Returns the attached schema's toRowFunction.Returns aTypeDescriptor<T>
with some reflective information aboutT
, if possible.WindowingStrategy
<?, ?> Returns theWindowingStrategy
of thisPCollection
.boolean
Returns whether thisPCollection
has an attached schema.Sets theCoder
used by thisPCollection
to encode and decode the values stored in it.setIsBoundedInternal
(PCollection.IsBounded isBounded) For internal use only; no backwards-compatibility guarantees.Sets the name of thisPCollection
.setRowSchema
(Schema schema) Sets a schema on this PCollection.setSchema
(Schema schema, TypeDescriptor<T> typeDescriptor, SerializableFunction<T, Row> toRowFunction, SerializableFunction<Row, T> fromRowFunction) Sets aSchema
on thisPCollection
.setTypeDescriptor
(TypeDescriptor<T> typeDescriptor) Sets theTypeDescriptor<T>
for thisPCollection<T>
.setWindowingStrategyInternal
(WindowingStrategy<?, ?> windowingStrategy) For internal use only; no backwards-compatibility guarantees.Methods inherited from class org.apache.beam.sdk.values.PValueBase
getKindString, getPipeline, toString
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface org.apache.beam.sdk.values.PInput
getPipeline
Methods inherited from interface org.apache.beam.sdk.values.POutput
getPipeline
-
Method Details
-
finishSpecifyingOutput
Description copied from interface:POutput
As part of applying the producingPTransform
, finalizes this output to make it ready for being used as an input and for running.This includes ensuring that all
PCollections
haveCoders
specified or defaulted.Automatically invoked whenever this
POutput
is output, afterPOutput.finishSpecifyingOutput(String, PInput, PTransform)
has been called on each componentPValue
returned byPOutput.expand()
.- Specified by:
finishSpecifyingOutput
in interfacePOutput
- Overrides:
finishSpecifyingOutput
in classPValueBase
-
finishSpecifying
After building, finalizes thisPValue
to make it ready for running. Automatically invoked whenever thePValue
is "used" (e.g., when apply() is called on it) and when the Pipeline is run (useful if this is aPValue
with no consumers).- Specified by:
finishSpecifying
in interfacePValue
- Overrides:
finishSpecifying
in classPValueBase
- Parameters:
input
- thePInput
thePTransform
was applied to to produce this outputtransform
- thePTransform
that produced thisPValue
-
getTypeDescriptor
Returns aTypeDescriptor<T>
with some reflective information aboutT
, if possible. May returnnull
if no information is available. Subclasses may override this to enable betterCoder
inference. -
getName
Returns the name of thisPCollection
.By default, the name of a
PCollection
is based on the name of thePTransform
that produces it. It can be specified explicitly by callingsetName(java.lang.String)
.- Specified by:
getName
in interfacePValue
- Overrides:
getName
in classPValueBase
- Throws:
IllegalStateException
- if the name hasn't been set yet
-
expand
Description copied from interface:PValue
Expands thisPOutput
into a list of its component outputPValues
.- A
PValue
expands to itself. - A tuple or list of
PValues
(such asPCollectionTuple
orPCollectionList
) expands to its componentPValue PValues
.
Not intended to be invoked directly by user code..
- A
-
setName
Sets the name of thisPCollection
. Returnsthis
.- Overrides:
setName
in classPValueBase
- Throws:
IllegalStateException
- if thisPCollection
has already been finalized and may no longer be set. Onceapply(org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<T>, OutputT>)
has been called, this will be the case.
-
getCoder
Returns theCoder
used by thisPCollection
to encode and decode the values stored in it.- Throws:
IllegalStateException
- if theCoder
hasn't been set, and couldn't be inferred.
-
setCoder
- Throws:
IllegalStateException
- if thisPCollection
has already been finalized and may no longer be set. Onceapply(org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<T>, OutputT>)
has been called, this will be the case.
-
setRowSchema
Sets a schema on this PCollection.Can only be called on a
PCollection<Row>
. -
setSchema
public PCollection<T> setSchema(Schema schema, TypeDescriptor<T> typeDescriptor, SerializableFunction<T, Row> toRowFunction, SerializableFunction<Row, T> fromRowFunction) Sets aSchema
on thisPCollection
. -
hasSchema
public boolean hasSchema()Returns whether thisPCollection
has an attached schema. -
getSchema
Returns the attached schema. -
getToRowFunction
Returns the attached schema's toRowFunction. -
getFromRowFunction
Returns the attached schema's fromRowFunction. -
apply
of thePTransform
.- Returns:
- the output of the applied
PTransform
-
apply
public <OutputT extends POutput> OutputT apply(String name, PTransform<? super PCollection<T>, OutputT> t) Applies the givenPTransform
to this inputPCollection
, usingname
to identify this specific application of the transform. This name is used in various places, including the monitoring UI, logging, and to stably identify this application node in the job graph.- Returns:
- the output of the applied
PTransform
-
getWindowingStrategy
Returns theWindowingStrategy
of thisPCollection
. -
isBounded
-
setTypeDescriptor
Sets theTypeDescriptor<T>
for thisPCollection<T>
. This may allow the enclosingPCollectionTuple
,PCollectionList
, orPTransform<?, PCollection<T>>
, etc., to provide more detailed reflective information. -
setWindowingStrategyInternal
@Internal public PCollection<T> setWindowingStrategyInternal(WindowingStrategy<?, ?> windowingStrategy) For internal use only; no backwards-compatibility guarantees. -
setIsBoundedInternal
For internal use only; no backwards-compatibility guarantees. -
createPrimitiveOutputInternal
@Internal public static <T> PCollection<T> createPrimitiveOutputInternal(Pipeline pipeline, WindowingStrategy<?, ?> windowingStrategy, PCollection.IsBounded isBounded, @Nullable Coder<T> coder) For internal use only; no backwards-compatibility guarantees. -
createPrimitiveOutputInternal
@Internal public static <T> PCollection<T> createPrimitiveOutputInternal(Pipeline pipeline, WindowingStrategy<?, ?> windowingStrategy, PCollection.IsBounded isBounded, @Nullable Coder<T> coder, TupleTag<?> tag) For internal use only; no backwards-compatibility guarantees.
-