Package org.apache.beam.sdk.transforms
Annotation Interface DoFn.GetSize
Annotation for the method that returns the corresponding size for an element and restriction
pair.
Signature: double getSize(<arguments>);
This method must satisfy the following constraints:
- If one of its arguments is tagged with the
DoFn.Elementannotation, then it will be passed the current element being processed; the argument must be of typeInputT. Note that automatic conversion ofRows andDoFn.FieldAccessparameters are currently unsupported. - If one of its arguments is tagged with the
DoFn.Restrictionannotation, then it will be passed the current restriction being processed; the argument must be of typeRestrictionT. - If one of its arguments is tagged with the
DoFn.Timestampannotation, then it will be passed the timestamp of the current element being processed; the argument must be of typeInstant. - If one of its arguments is a
RestrictionTracker, then it will be passed a tracker that is initialized for the currentDoFn.Restriction. The argument must be of the exact typeRestrictionTracker<RestrictionT, PositionT>. - If one of its arguments is a subtype of
BoundedWindow, then it will be passed the window of the current element. When applied byParDothe subtype ofBoundedWindowmust match the type of windows on the inputPCollection. If the window is not accessed a runner may perform additional optimizations. - If one of its arguments is of type
PaneInfo, then it will be passed information about the current triggering pane. - If one of the parameters is of type
PipelineOptions, then it will be passed the options for the current pipeline.
Returns a non-negative double representing the size of the current element and restriction.
Splittable DoFns should only provide this method if the default RestrictionTracker.HasProgress implementation within the RestrictionTracker is an
inaccurate representation of known work.
It is up to each splittable to convert between their natural representation of outstanding work and this representation. For example:
- Block based file source (e.g. Avro): The number of bytes that will be read from the file.
- Pull based queue based source (e.g. Pubsub): The local/global size available in number of
messages or number of
message bytesthat have not been processed. - Key range based source (e.g. Shuffle, Bigtable, ...): Typically
1.0unless additional details such as the number of bytes for keys and values is known for the key range.
Must be thread safe. Will be invoked concurrently during bundle processing due to runner initiated splitting and progress estimation.