InputT
- the type of the (main) input elementsOutputT
- the type of the (main) output elementspublic abstract class DoFn<InputT,OutputT> extends java.lang.Object implements java.io.Serializable, HasDisplayData
ParDo
providing the code to use to process
elements of the input
PCollection
.
See ParDo
for more explanation, examples of use, and
discussion of constraints on DoFn
s, including their
serializability, lack of access to global shared mutable state,
requirements for failure tolerance, and benefits of optimization.
DoFn
s can be tested in a particular
Pipeline
by running that Pipeline
on sample input
and then checking its output. Unit testing of a DoFn
,
separately from any ParDo
transform or Pipeline
,
can be done via the DoFnTester
harness.
Implementations must define a method annotated with DoFn.ProcessElement
that satisfies the requirements described there. See the DoFn.ProcessElement
for details.
Example usage:
PCollection<String> lines = ... ;
PCollection<String> words =
lines.apply(ParDo.of(new DoFn<String, String>()) {
@ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
...
}}));
Modifier and Type | Class and Description |
---|---|
static interface |
DoFn.BoundedPerElement
Annotation on a splittable
DoFn
specifying that the DoFn performs a bounded amount of work per input element, so
applying it to a bounded PCollection will produce also a bounded PCollection . |
static interface |
DoFn.FinishBundle
Annotation for the method to use to finish processing a batch of elements.
|
class |
DoFn.FinishBundleContext
Information accessible while within the
DoFn.FinishBundle method. |
static interface |
DoFn.GetInitialRestriction
Annotation for the method that maps an element to an initial restriction for a splittable
DoFn . |
static interface |
DoFn.GetRestrictionCoder
Annotation for the method that returns the coder to use for the restriction of a splittable
DoFn . |
static interface |
DoFn.NewTracker
Annotation for the method that creates a new
RestrictionTracker for the restriction of
a splittable DoFn . |
static interface |
DoFn.OnTimer
Annotation for registering a callback for a timer.
|
class |
DoFn.OnTimerContext
Information accessible when running a
DoFn.OnTimer method. |
static interface |
DoFn.OutputReceiver<T>
Receives values of the given type.
|
class |
DoFn.ProcessContext
Information accessible when running a
DoFn.ProcessElement method. |
class |
DoFn.ProcessContinuation
Temporary, do not use.
|
static interface |
DoFn.ProcessElement
Annotation for the method to use for processing elements.
|
static interface |
DoFn.Setup
Annotation for the method to use to prepare an instance for processing bundles of elements.
|
static interface |
DoFn.SplitRestriction
Annotation for the method that splits restriction of a splittable
DoFn into multiple parts to
be processed in parallel. |
static interface |
DoFn.StartBundle
Annotation for the method to use to prepare an instance for processing a batch of elements.
|
class |
DoFn.StartBundleContext
Information accessible while within the
DoFn.StartBundle method. |
static interface |
DoFn.StateId
Annotation for declaring and dereferencing state cells.
|
static interface |
DoFn.Teardown
Annotation for the method to use to clean up this instance after processing bundles of
elements.
|
static interface |
DoFn.TimerId
Annotation for declaring and dereferencing timers.
|
static interface |
DoFn.UnboundedPerElement
Annotation on a splittable
DoFn
specifying that the DoFn performs an unbounded amount of work per input element, so
applying it to a bounded PCollection will produce an unbounded PCollection . |
class |
DoFn.WindowedContext
Information accessible to all methods in this
DoFn where the context is in some window. |
Constructor and Description |
---|
DoFn() |
Modifier and Type | Method and Description |
---|---|
Duration |
getAllowedTimestampSkew()
Deprecated.
This method permits a
DoFn to emit elements behind the watermark. These
elements are considered late, and if behind the
allowed lateness of a downstream
PCollection may be silently dropped. See
https://issues.apache.org/jira/browse/BEAM-644 for details on a replacement. |
TypeDescriptor<InputT> |
getInputTypeDescriptor()
Returns a
TypeDescriptor capturing what is known statically
about the input type of this DoFn instance's most-derived
class. |
TypeDescriptor<OutputT> |
getOutputTypeDescriptor()
Returns a
TypeDescriptor capturing what is known statically
about the output type of this DoFn instance's
most-derived class. |
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
void |
prepareForProcessing()
Deprecated.
|
@Deprecated public Duration getAllowedTimestampSkew()
DoFn
to emit elements behind the watermark. These
elements are considered late, and if behind the
allowed lateness
of a downstream
PCollection
may be silently dropped. See
https://issues.apache.org/jira/browse/BEAM-644 for details on a replacement.DoFn.WindowedContext.outputWithTimestamp(OutputT, org.joda.time.Instant)
.
The default value is Duration.ZERO
, in which case timestamps can only be shifted
forward to future. For infinite skew, return Duration.millis(Long.MAX_VALUE)
.
public TypeDescriptor<InputT> getInputTypeDescriptor()
TypeDescriptor
capturing what is known statically
about the input type of this DoFn
instance's most-derived
class.
See getOutputTypeDescriptor()
for more discussion.
public TypeDescriptor<OutputT> getOutputTypeDescriptor()
TypeDescriptor
capturing what is known statically
about the output type of this DoFn
instance's
most-derived class.
In the normal case of a concrete DoFn
subclass with
no generic type parameters of its own (including anonymous inner
classes), this will be a complete non-generic type, which is good
for choosing a default output Coder<O>
for the output
PCollection<O>
.
@Deprecated public final void prepareForProcessing()
DoFn
construction to prepare for processing.
This method should be called by runners before any processing methods.public void populateDisplayData(DisplayData.Builder builder)
populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData)
. Implementations may call
super.populateDisplayData(builder)
in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder)
to use
the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData
in interface HasDisplayData
builder
- The builder to populate with display data.HasDisplayData