InputT
- the type of the (main) input elementsOutputT
- the type of the (main) output elementspublic abstract class DoFn<InputT,OutputT> extends java.lang.Object implements java.io.Serializable, HasDisplayData
ParDo
providing the code to use to process elements of the input PCollection
.
See ParDo
for more explanation, examples of use, and discussion of constraints on
DoFn
s, including their serializability, lack of access to global shared mutable state,
requirements for failure tolerance, and benefits of optimization.
DoFns
can be tested by using TestPipeline
. You can verify their
functional correctness in a local test using the DirectRunner
as well as running
integration tests with your production runner of choice. Typically, you can generate the input
data using Create.of(java.lang.Iterable<T>)
or other transforms. However, if you need to test the behavior of
DoFn.StartBundle
and DoFn.FinishBundle
with particular bundle boundaries, you can use
TestStream
.
Implementations must define a method annotated with DoFn.ProcessElement
that satisfies the
requirements described there. See the DoFn.ProcessElement
for details.
Example usage:
PCollection<String> lines = ... ;
PCollection<String> words =
lines.apply(ParDo.of(new DoFn<String, String>()) {
@ProcessElement
public void processElement( @Element String element, BoundedWindow window) {
...
}}));
Modifier and Type | Class and Description |
---|---|
static interface |
DoFn.AlwaysFetched
Annotation for declaring that a state parameter is always fetched.
|
static interface |
DoFn.BoundedPerElement
Annotation on a splittable
DoFn
specifying that the DoFn performs a bounded amount of work per input element, so
applying it to a bounded PCollection will produce also a bounded PCollection . |
static interface |
DoFn.BundleFinalizer
A parameter that is accessible during
@StartBundle , @ProcessElement and @FinishBundle that allows the caller
to register a callback that will be invoked after the bundle has been successfully completed
and the runner has commit the output. |
static interface |
DoFn.Element
Parameter annotation for the input element for
DoFn.ProcessElement , DoFn.GetInitialRestriction , DoFn.GetSize , DoFn.SplitRestriction , DoFn.GetInitialWatermarkEstimatorState , DoFn.NewWatermarkEstimator , and DoFn.NewTracker
methods. |
static interface |
DoFn.FieldAccess
Annotation for specifying specific fields that are accessed in a Schema PCollection.
|
static interface |
DoFn.FinishBundle
Annotation for the method to use to finish processing a batch of elements.
|
class |
DoFn.FinishBundleContext
Information accessible while within the
DoFn.FinishBundle method. |
static interface |
DoFn.GetInitialRestriction
Annotation for the method that maps an element to an initial restriction for a splittable
DoFn . |
static interface |
DoFn.GetInitialWatermarkEstimatorState
Annotation for the method that maps an element and restriction to initial watermark estimator
state for a splittable
DoFn . |
static interface |
DoFn.GetRestrictionCoder
Annotation for the method that returns the coder to use for the restriction of a splittable
DoFn . |
static interface |
DoFn.GetSize
Annotation for the method that returns the corresponding size for an element and restriction
pair.
|
static interface |
DoFn.GetWatermarkEstimatorStateCoder
Annotation for the method that returns the coder to use for the watermark estimator state of a
splittable
DoFn . |
static interface |
DoFn.Key
Parameter annotation for dereferencing input element key in
KV pair. |
static interface |
DoFn.MultiOutputReceiver
Receives tagged output for a multi-output function.
|
static interface |
DoFn.NewTracker
Annotation for the method that creates a new
RestrictionTracker for the restriction of
a splittable DoFn . |
static interface |
DoFn.NewWatermarkEstimator
Annotation for the method that creates a new
WatermarkEstimator for the watermark state
of a splittable DoFn . |
static interface |
DoFn.OnTimer
Annotation for registering a callback for a timer.
|
class |
DoFn.OnTimerContext
Information accessible when running a
DoFn.OnTimer method. |
static interface |
DoFn.OnTimerFamily
Annotation for registering a callback for a timerFamily.
|
static interface |
DoFn.OnWindowExpiration
Annotation for the method to use for performing actions on window expiration.
|
class |
DoFn.OnWindowExpirationContext |
static interface |
DoFn.OutputReceiver<T>
Receives values of the given type.
|
class |
DoFn.ProcessContext
Information accessible when running a
DoFn.ProcessElement method. |
static class |
DoFn.ProcessContinuation
When used as a return value of
DoFn.ProcessElement , indicates whether there is more work to
be done for the current element. |
static interface |
DoFn.ProcessElement
Annotation for the method to use for processing elements.
|
static interface |
DoFn.RequiresStableInput
Experimental - no backwards compatibility guarantees.
|
static interface |
DoFn.RequiresTimeSortedInput
Experimental - no backwards compatibility guarantees.
|
static interface |
DoFn.Restriction
Parameter annotation for the restriction for
DoFn.GetSize , DoFn.SplitRestriction , DoFn.GetInitialWatermarkEstimatorState , DoFn.NewWatermarkEstimator , and DoFn.NewTracker
methods. |
static interface |
DoFn.Setup
Annotation for the method to use to prepare an instance for processing bundles of elements.
|
static interface |
DoFn.SideInput
Parameter annotation for the SideInput for a
DoFn.ProcessElement method. |
static interface |
DoFn.SplitRestriction
Annotation for the method that splits restriction of a splittable
DoFn into multiple parts to
be processed in parallel. |
static interface |
DoFn.StartBundle
Annotation for the method to use to prepare an instance for processing a batch of elements.
|
class |
DoFn.StartBundleContext
Information accessible while within the
DoFn.StartBundle method. |
static interface |
DoFn.StateId
Annotation for declaring and dereferencing state cells.
|
static interface |
DoFn.Teardown
Annotation for the method to use to clean up this instance before it is discarded.
|
static interface |
DoFn.TimerFamily
Parameter annotation for the TimerMap for a
DoFn.ProcessElement method. |
static interface |
DoFn.TimerId
Annotation for declaring and dereferencing timers.
|
static interface |
DoFn.Timestamp
Parameter annotation for the input element timestamp for
DoFn.ProcessElement , DoFn.GetInitialRestriction , DoFn.GetSize , DoFn.SplitRestriction , DoFn.GetInitialWatermarkEstimatorState , DoFn.NewWatermarkEstimator , and DoFn.NewTracker
methods. |
static interface |
DoFn.UnboundedPerElement
Annotation on a splittable
DoFn
specifying that the DoFn performs an unbounded amount of work per input element, so
applying it to a bounded PCollection will produce an unbounded PCollection . |
static interface |
DoFn.WatermarkEstimatorState
Parameter annotation for the watermark estimator state for the
DoFn.NewWatermarkEstimator
method. |
class |
DoFn.WindowedContext
Information accessible to all methods in this
DoFn where the context is in some window. |
Constructor and Description |
---|
DoFn() |
Modifier and Type | Method and Description |
---|---|
Duration |
getAllowedTimestampSkew()
Deprecated.
This method permits a
DoFn to emit elements behind the watermark. These
elements are considered late, and if behind the allowed lateness of a downstream PCollection may be silently dropped. See
https://issues.apache.org/jira/browse/BEAM-644 for details on a replacement. |
TypeDescriptor<InputT> |
getInputTypeDescriptor()
Returns a
TypeDescriptor capturing what is known statically about the input type of
this DoFn instance's most-derived class. |
TypeDescriptor<OutputT> |
getOutputTypeDescriptor()
Returns a
TypeDescriptor capturing what is known statically about the output type of
this DoFn instance's most-derived class. |
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
void |
prepareForProcessing()
Deprecated.
use
DoFn.Setup or DoFn.StartBundle instead. This method will be removed in a
future release. |
@Deprecated public Duration getAllowedTimestampSkew()
DoFn
to emit elements behind the watermark. These
elements are considered late, and if behind the allowed lateness
of a downstream PCollection
may be silently dropped. See
https://issues.apache.org/jira/browse/BEAM-644 for details on a replacement.DoFn.WindowedContext.outputWithTimestamp(OutputT, org.joda.time.Instant)
.
The default value is Duration.ZERO
, in which case timestamps can only be shifted
forward to future. For infinite skew, return Duration.millis(Long.MAX_VALUE)
.
public TypeDescriptor<InputT> getInputTypeDescriptor()
TypeDescriptor
capturing what is known statically about the input type of
this DoFn
instance's most-derived class.
See getOutputTypeDescriptor()
for more discussion.
public TypeDescriptor<OutputT> getOutputTypeDescriptor()
TypeDescriptor
capturing what is known statically about the output type of
this DoFn
instance's most-derived class.
In the normal case of a concrete DoFn
subclass with no generic type parameters of
its own (including anonymous inner classes), this will be a complete non-generic type, which is
good for choosing a default output Coder<O>
for the output PCollection<O>
.
@Deprecated public final void prepareForProcessing()
DoFn.Setup
or DoFn.StartBundle
instead. This method will be removed in a
future release.DoFn
construction to prepare for processing. This method should be called
by runners before any processing methods.public void populateDisplayData(DisplayData.Builder builder)
populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData)
. Implementations may call super.populateDisplayData(builder)
in order to register display data in the current namespace,
but should otherwise use subcomponent.populateDisplayData(builder)
to use the namespace
of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData
in interface HasDisplayData
builder
- The builder to populate with display data.HasDisplayData