InputT
- the type of the (main) input elementsOutputT
- the type of the (main) output elementspublic abstract class DoFn<InputT,OutputT> extends java.lang.Object implements java.io.Serializable, HasDisplayData
ParDo
providing the code to use to process elements of the input PCollection
.
See ParDo
for more explanation, examples of use, and discussion of constraints on
DoFn
s, including their serializability, lack of access to global shared mutable state,
requirements for failure tolerance, and benefits of optimization.
DoFns
can be tested by using TestPipeline
. You can verify their
functional correctness in a local test using the DirectRunner
as well as running
integration tests with your production runner of choice. Typically, you can generate the input
data using Create.of(java.lang.Iterable<T>)
or other transforms. However, if you need to test the behavior of
DoFn.StartBundle
and DoFn.FinishBundle
with particular bundle boundaries, you can use
TestStream
.
Implementations must define a method annotated with DoFn.ProcessElement
that satisfies the
requirements described there. See the DoFn.ProcessElement
for details.
Example usage:
PCollection<String> lines = ... ;
PCollection<String> words =
lines.apply(ParDo.of(new DoFn<String, String>()) {
@ProcessElement
public void processElement( @Element String element, BoundedWindow window) {
...
}}));
Modifier and Type | Class and Description |
---|---|
static interface |
DoFn.BoundedPerElement
Annotation on a splittable
DoFn
specifying that the DoFn performs a bounded amount of work per input element, so
applying it to a bounded PCollection will produce also a bounded PCollection . |
static interface |
DoFn.Element
Parameter annotation for the input element for a
DoFn.ProcessElement method. |
static interface |
DoFn.FinishBundle
Annotation for the method to use to finish processing a batch of elements.
|
class |
DoFn.FinishBundleContext
Information accessible while within the
DoFn.FinishBundle method. |
static interface |
DoFn.GetInitialRestriction
Annotation for the method that maps an element to an initial restriction for a splittable
DoFn . |
static interface |
DoFn.GetRestrictionCoder
Annotation for the method that returns the coder to use for the restriction of a splittable
DoFn . |
static interface |
DoFn.MultiOutputReceiver
Receives tagged output for a multi-output function.
|
static interface |
DoFn.NewTracker
Annotation for the method that creates a new
RestrictionTracker for the restriction of
a splittable DoFn . |
static interface |
DoFn.OnTimer
Annotation for registering a callback for a timer.
|
class |
DoFn.OnTimerContext
Information accessible when running a
DoFn.OnTimer method. |
static interface |
DoFn.OutputReceiver<T>
Receives values of the given type.
|
class |
DoFn.ProcessContext
Information accessible when running a
DoFn.ProcessElement method. |
static class |
DoFn.ProcessContinuation
When used as a return value of
DoFn.ProcessElement , indicates whether there is more work to
be done for the current element. |
static interface |
DoFn.ProcessElement
Annotation for the method to use for processing elements.
|
static interface |
DoFn.RequiresStableInput
Experimental - no backwards compatibility guarantees.
|
static interface |
DoFn.Setup
Annotation for the method to use to prepare an instance for processing bundles of elements.
|
static interface |
DoFn.SplitRestriction
Annotation for the method that splits restriction of a splittable
DoFn into multiple parts to
be processed in parallel. |
static interface |
DoFn.StartBundle
Annotation for the method to use to prepare an instance for processing a batch of elements.
|
class |
DoFn.StartBundleContext
Information accessible while within the
DoFn.StartBundle method. |
static interface |
DoFn.StateId
Annotation for declaring and dereferencing state cells.
|
static interface |
DoFn.Teardown
Annotation for the method to use to clean up this instance before it is discarded.
|
static interface |
DoFn.TimerId
Annotation for declaring and dereferencing timers.
|
static interface |
DoFn.Timestamp
Parameter annotation for the input element timestamp for a
DoFn.ProcessElement method. |
static interface |
DoFn.UnboundedPerElement
Annotation on a splittable
DoFn
specifying that the DoFn performs an unbounded amount of work per input element, so
applying it to a bounded PCollection will produce an unbounded PCollection . |
class |
DoFn.WindowedContext
Information accessible to all methods in this
DoFn where the context is in some window. |
Constructor and Description |
---|
DoFn() |
Modifier and Type | Method and Description |
---|---|
Duration |
getAllowedTimestampSkew()
Deprecated.
This method permits a
DoFn to emit elements behind the watermark. These
elements are considered late, and if behind the
allowed lateness of a downstream
PCollection may be silently dropped. See
https://issues.apache.org/jira/browse/BEAM-644 for details on a replacement. |
TypeDescriptor<InputT> |
getInputTypeDescriptor()
Returns a
TypeDescriptor capturing what is known statically
about the input type of this DoFn instance's most-derived
class. |
TypeDescriptor<OutputT> |
getOutputTypeDescriptor()
Returns a
TypeDescriptor capturing what is known statically
about the output type of this DoFn instance's
most-derived class. |
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
void |
prepareForProcessing()
Deprecated.
use
DoFn.Setup or DoFn.StartBundle instead. This method will be removed in a
future release. |
@Deprecated public Duration getAllowedTimestampSkew()
DoFn
to emit elements behind the watermark. These
elements are considered late, and if behind the
allowed lateness
of a downstream
PCollection
may be silently dropped. See
https://issues.apache.org/jira/browse/BEAM-644 for details on a replacement.DoFn.WindowedContext.outputWithTimestamp(OutputT, org.joda.time.Instant)
.
The default value is Duration.ZERO
, in which case timestamps can only be shifted
forward to future. For infinite skew, return Duration.millis(Long.MAX_VALUE)
.
public TypeDescriptor<InputT> getInputTypeDescriptor()
TypeDescriptor
capturing what is known statically
about the input type of this DoFn
instance's most-derived
class.
See getOutputTypeDescriptor()
for more discussion.
public TypeDescriptor<OutputT> getOutputTypeDescriptor()
TypeDescriptor
capturing what is known statically
about the output type of this DoFn
instance's
most-derived class.
In the normal case of a concrete DoFn
subclass with
no generic type parameters of its own (including anonymous inner
classes), this will be a complete non-generic type, which is good
for choosing a default output Coder<O>
for the output
PCollection<O>
.
@Deprecated public final void prepareForProcessing()
DoFn.Setup
or DoFn.StartBundle
instead. This method will be removed in a
future release.DoFn
construction to prepare for processing.
This method should be called by runners before any processing methods.public void populateDisplayData(DisplayData.Builder builder)
populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData)
. Implementations may call
super.populateDisplayData(builder)
in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder)
to use
the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData
in interface HasDisplayData
builder
- The builder to populate with display data.HasDisplayData