Class UnboundedSource<OutputT,CheckpointMarkT extends UnboundedSource.CheckpointMark>
- Type Parameters:
OutputT
- Type of records output by this source.CheckpointMarkT
- Type of checkpoint marks used by the readers of this source.
- All Implemented Interfaces:
Serializable
,HasDisplayData
- Direct Known Subclasses:
UnboundedSolaceSource
,UnboundedSourceImpl
Source
that reads an unbounded amount of input and, because of that, supports some
additional operations such as checkpointing, watermarks, and record ids.
- Checkpointing allows sources to not re-read the same data again in the case of failures.
- Watermarks allow for downstream parts of the pipeline to know up to what point in time the data is complete.
- Record ids allow for efficient deduplication of input records; many streaming sources do not guarantee that a given record will only be read a single time.
See Window
and Trigger
for more information on timestamps and
watermarks.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic interface
A marker representing the progress and state of anUnboundedSource.UnboundedReader
.static class
AReader
that reads an unbounded amount of input.Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source
Source.Reader<T>
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionabstract UnboundedSource.UnboundedReader
<OutputT> createReader
(PipelineOptions options, @Nullable CheckpointMarkT checkpointMark) Create a newUnboundedSource.UnboundedReader
to read from this source, resuming from the given checkpoint if present.abstract Coder
<CheckpointMarkT> Returns aCoder
for encoding and decoding the checkpoints for this source.boolean
If offsetBasedDeduplicationSupported returns true, then the UnboundedSource needs to provide the following: UnboundedReader which provides offsets that are unique for each element and lexicographically ordered.boolean
Returns whether this source requires explicit deduping.abstract List
<? extends UnboundedSource<OutputT, CheckpointMarkT>> split
(int desiredNumSplits, PipelineOptions options) Returns a list ofUnboundedSource
objects representing the instances of this source that should be used when executing the workflow.Methods inherited from class org.apache.beam.sdk.io.Source
getDefaultOutputCoder, getOutputCoder, populateDisplayData, validate
-
Constructor Details
-
UnboundedSource
public UnboundedSource()
-
-
Method Details
-
split
public abstract List<? extends UnboundedSource<OutputT,CheckpointMarkT>> split(int desiredNumSplits, PipelineOptions options) throws Exception Returns a list ofUnboundedSource
objects representing the instances of this source that should be used when executing the workflow. Each split should return a separate partition of the input data.For example, for a source reading from a growing directory of files, each split could correspond to a prefix of file names.
Some sources are not splittable, such as reading from a single TCP stream. In that case, only a single split should be returned.
Some data sources automatically partition their data among readers. For these types of inputs,
n
identical replicas of the top-level source can be returned.The size of the returned list should be as close to
desiredNumSplits
as possible, but does not have to match exactly. A low number of splits will limit the amount of parallelism in the source.- Throws:
Exception
-
createReader
public abstract UnboundedSource.UnboundedReader<OutputT> createReader(PipelineOptions options, @Nullable CheckpointMarkT checkpointMark) throws IOException Create a newUnboundedSource.UnboundedReader
to read from this source, resuming from the given checkpoint if present.- Throws:
IOException
-
getCheckpointMarkCoder
Returns aCoder
for encoding and decoding the checkpoints for this source. -
requiresDeduping
public boolean requiresDeduping()Returns whether this source requires explicit deduping.This is needed if the underlying data source can return the same record multiple times, such a queuing system with a pull-ack model. Sources where the records read are uniquely identified by the persisted state in the CheckpointMark do not need this.
Generally, if
UnboundedSource.CheckpointMark.finalizeCheckpoint()
is overridden, this method should return true. Checkpoint finalization is best-effort, and readers can be resumed from a checkpoint that has not been finalized. -
offsetBasedDeduplicationSupported
public boolean offsetBasedDeduplicationSupported()If offsetBasedDeduplicationSupported returns true, then the UnboundedSource needs to provide the following:- UnboundedReader which provides offsets that are unique for each element and lexicographically ordered.
- CheckpointMark which provides an offset greater than all elements read and less than or equal to the next offset that will be read.
-