@Experimental(value=SOURCE_SINK) public abstract static class UnboundedSource.UnboundedReader<OutputT> extends Source.Reader<OutputT>
Reader
that reads an unbounded amount of input.
A given UnboundedReader
object will only be accessed by a single thread at once.
Modifier and Type | Field and Description |
---|---|
static long |
BACKLOG_UNKNOWN
Constant representing an unknown amount of backlog.
|
Constructor and Description |
---|
UnboundedReader() |
Modifier and Type | Method and Description |
---|---|
abstract boolean |
advance()
Advances the reader to the next valid record.
|
abstract UnboundedSource.CheckpointMark |
getCheckpointMark()
Returns a
UnboundedSource.CheckpointMark representing the progress of this UnboundedReader . |
byte[] |
getCurrentRecordId()
Returns a unique identifier for the current record.
|
abstract UnboundedSource<OutputT,?> |
getCurrentSource()
Returns the
UnboundedSource that created this reader. |
long |
getSplitBacklogBytes()
Returns the size of the backlog of unread data in the underlying data source represented by
this split of this source.
|
long |
getTotalBacklogBytes()
Returns the size of the backlog of unread data in the underlying data source represented by
all splits of this source.
|
abstract Instant |
getWatermark()
Returns a timestamp before or at the timestamps of all future elements read by this reader.
|
abstract boolean |
start()
Initializes the reader and advances the reader to the first record.
|
close, getCurrent, getCurrentTimestamp
public static final long BACKLOG_UNKNOWN
public abstract boolean start() throws java.io.IOException
This method will be called exactly once. The invocation will occur prior to calling
advance()
or Source.Reader.getCurrent()
. This method may perform expensive operations that
are needed to initialize the reader.
Returns true
if a record was read, false
if there is no more input
currently available. Future calls to advance()
may return true
once more data
is available. Regardless of the return value of start
, start
will not be
called again on the same UnboundedReader
object; it will only be called again when a
new reader object is constructed for the same source, e.g. on recovery.
start
in class Source.Reader<OutputT>
true
if a record was read, false
if there is no more input available.java.io.IOException
public abstract boolean advance() throws java.io.IOException
Returns true
if a record was read, false
if there is no more input
available. Future calls to advance()
may return true
once more data is
available.
advance
in class Source.Reader<OutputT>
true
if a record was read, false
if there is no more input available.java.io.IOException
public byte[] getCurrentRecordId() throws java.util.NoSuchElementException
It is only necessary to override this if UnboundedSource.requiresDeduping()
has been overridden to
return true.
For example, this could be a hash of the record contents, or a logical ID present in the record. If this is generated as a hash of the record contents, it should be at least 16 bytes (128 bits) to avoid collisions.
This method has the same restrictions on when it can be called as Source.Reader.getCurrent()
and
Source.Reader.getCurrentTimestamp()
.
public abstract Instant getWatermark()
This can be approximate. If records are read that violate this guarantee, they will be
considered late, which will affect how they will be processed. See
Window
for more information on
late data and how to handle it.
However, this value should be as late as possible. Downstream windows may not be able to close until this watermark passes their end.
For example, a source may know that the records it reads will be in timestamp order. In this case, the watermark can be the timestamp of the last record read. For a source that does not have natural timestamps, timestamps can be set to the time of reading, in which case the watermark is the current clock time.
See Window
and
Trigger
for more
information on timestamps and watermarks.
May be called after advance()
or start()
has returned false, but not before
start()
has been called.
public abstract UnboundedSource.CheckpointMark getCheckpointMark()
UnboundedSource.CheckpointMark
representing the progress of this UnboundedReader
.
If this UnboundedReader
does not support checkpoints, it may return a
CheckpointMark which does nothing, like:
public UnboundedSource.CheckpointMark getCheckpointMark() {
return new UnboundedSource.CheckpointMark() {
public void finalizeCheckpoint() throws IOException {
// nothing to do
}
};
}
All elements read between the last time this method was called (or since this reader was
created, if this method has not been called on this reader) until this method is called will
be processed together as a bundle. (An element is considered 'read' if it could be returned
by a call to Source.Reader.getCurrent()
.)
Once the result of processing those elements and the returned checkpoint have been durably
committed, UnboundedSource.CheckpointMark.finalizeCheckpoint()
will be called at most once at some
later point on the returned UnboundedSource.CheckpointMark
object. Checkpoint finalization is
best-effort, and checkpoints may not be finalized. If duplicate elements may be produced if
checkpoints are not finalized in a timely manner, UnboundedSource.requiresDeduping()
should be overridden to return true, and getCurrentRecordId()
should
be overridden to return unique record IDs.
A checkpoint will be committed to durable storage only if all all previous checkpoints produced by the same reader have also been committed.
The returned object should not be modified.
May not be called before start()
has been called.
public long getSplitBacklogBytes()
One of this or getTotalBacklogBytes()
should be overridden in order to allow the
runner to scale the amount of resources allocated to the pipeline.
public long getTotalBacklogBytes()
One of this or getSplitBacklogBytes()
should be overridden in order to allow the
runner to scale the amount of resources allocated to the pipeline.
public abstract UnboundedSource<OutputT,?> getCurrentSource()
UnboundedSource
that created this reader. This will not change over the
life of the reader.getCurrentSource
in class Source.Reader<OutputT>