Class UnboundedSource<OutputT,CheckpointMarkT extends UnboundedSource.CheckpointMark>

java.lang.Object
org.apache.beam.sdk.io.Source<OutputT>
org.apache.beam.sdk.io.UnboundedSource<OutputT,CheckpointMarkT>
Type Parameters:
OutputT - Type of records output by this source.
CheckpointMarkT - Type of checkpoint marks used by the readers of this source.
All Implemented Interfaces:
Serializable, HasDisplayData
Direct Known Subclasses:
UnboundedSolaceSource, UnboundedSourceImpl

public abstract class UnboundedSource<OutputT,CheckpointMarkT extends UnboundedSource.CheckpointMark> extends Source<OutputT>
A Source that reads an unbounded amount of input and, because of that, supports some additional operations such as checkpointing, watermarks, and record ids.
  • Checkpointing allows sources to not re-read the same data again in the case of failures.
  • Watermarks allow for downstream parts of the pipeline to know up to what point in time the data is complete.
  • Record ids allow for efficient deduplication of input records; many streaming sources do not guarantee that a given record will only be read a single time.

See Window and Trigger for more information on timestamps and watermarks.

See Also:
  • Constructor Details

    • UnboundedSource

      public UnboundedSource()
  • Method Details

    • split

      public abstract List<? extends UnboundedSource<OutputT,CheckpointMarkT>> split(int desiredNumSplits, PipelineOptions options) throws Exception
      Returns a list of UnboundedSource objects representing the instances of this source that should be used when executing the workflow. Each split should return a separate partition of the input data.

      For example, for a source reading from a growing directory of files, each split could correspond to a prefix of file names.

      Some sources are not splittable, such as reading from a single TCP stream. In that case, only a single split should be returned.

      Some data sources automatically partition their data among readers. For these types of inputs, n identical replicas of the top-level source can be returned.

      The size of the returned list should be as close to desiredNumSplits as possible, but does not have to match exactly. A low number of splits will limit the amount of parallelism in the source.

      Throws:
      Exception
    • createReader

      public abstract UnboundedSource.UnboundedReader<OutputT> createReader(PipelineOptions options, @Nullable CheckpointMarkT checkpointMark) throws IOException
      Create a new UnboundedSource.UnboundedReader to read from this source, resuming from the given checkpoint if present.
      Throws:
      IOException
    • getCheckpointMarkCoder

      public abstract Coder<CheckpointMarkT> getCheckpointMarkCoder()
      Returns a Coder for encoding and decoding the checkpoints for this source.
    • requiresDeduping

      public boolean requiresDeduping()
      Returns whether this source requires explicit deduping.

      This is needed if the underlying data source can return the same record multiple times, such a queuing system with a pull-ack model. Sources where the records read are uniquely identified by the persisted state in the CheckpointMark do not need this.

      Generally, if UnboundedSource.CheckpointMark.finalizeCheckpoint() is overridden, this method should return true. Checkpoint finalization is best-effort, and readers can be resumed from a checkpoint that has not been finalized.

    • offsetBasedDeduplicationSupported

      public boolean offsetBasedDeduplicationSupported()
      If offsetBasedDeduplicationSupported returns true, then the UnboundedSource needs to provide the following:
      • UnboundedReader which provides offsets that are unique for each element and lexicographically ordered.
      • CheckpointMark which provides an offset greater than all elements read and less than or equal to the next offset that will be read.