java.lang.Object

org.apache.beam.sdk.io.Source<OutputT>

org.apache.beam.sdk.io.UnboundedSource<OutputT,CheckpointMarkT>

Type Parameters:: OutputT - Type of records output by this source.; CheckpointMarkT - Type of checkpoint marks used by the readers of this source.

All Implemented Interfaces:: Serializable, HasDisplayData

Direct Known Subclasses:: UnboundedSolaceSource, UnboundedSourceImpl

public abstract class UnboundedSource<OutputT,CheckpointMarkT extends UnboundedSource.CheckpointMark> extends Source<OutputT>

A Source that reads an unbounded amount of input and, because of that, supports some additional operations such as checkpointing, watermarks, and record ids.

Checkpointing allows sources to not re-read the same data again in the case of failures.
Watermarks allow for downstream parts of the pipeline to know up to what point in time the data is complete.
Record ids allow for efficient deduplication of input records; many streaming sources do not guarantee that a given record will only be read a single time.

See Window and Trigger for more information on timestamps and watermarks.

See Also:

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static interface

UnboundedSource.CheckpointMark

A marker representing the progress and state of an UnboundedSource.UnboundedReader.

static class

UnboundedSource.UnboundedReader<OutputT>

A Reader that reads an unbounded amount of input.

Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source
Source.Reader<T>
Constructor Summary

Constructors

Constructor

Description

UnboundedSource()
Method Summary

Modifier and Type

Method

Description

abstract UnboundedSource.UnboundedReader<OutputT>

createReader(PipelineOptions options, @Nullable CheckpointMarkT checkpointMark)

Create a new UnboundedSource.UnboundedReader to read from this source, resuming from the given checkpoint if present.

abstract Coder<CheckpointMarkT>

getCheckpointMarkCoder()

Returns a Coder for encoding and decoding the checkpoints for this source.

boolean

offsetBasedDeduplicationSupported()

If offsetBasedDeduplicationSupported returns true, then the UnboundedSource needs to provide the following: UnboundedReader which provides offsets that are unique for each element and lexicographically ordered.

boolean

requiresDeduping()

Returns whether this source requires explicit deduping.

abstract List<? extends UnboundedSource<OutputT,CheckpointMarkT>>

split(int desiredNumSplits, PipelineOptions options)

Returns a list of UnboundedSource objects representing the instances of this source that should be used when executing the workflow.

Methods inherited from class org.apache.beam.sdk.io.Source
getDefaultOutputCoder, getOutputCoder, populateDisplayData, validate

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- UnboundedSource
  
  public UnboundedSource()
Method Details
- split
  
  public abstract List<? extends UnboundedSource<OutputT,CheckpointMarkT>> split(int desiredNumSplits, PipelineOptions options) throws Exception
  
  Returns a list of UnboundedSource objects representing the instances of this source that should be used when executing the workflow. Each split should return a separate partition of the input data.
  For example, for a source reading from a growing directory of files, each split could correspond to a prefix of file names.
  Some sources are not splittable, such as reading from a single TCP stream. In that case, only a single split should be returned.
  Some data sources automatically partition their data among readers. For these types of inputs, n identical replicas of the top-level source can be returned.
  The size of the returned list should be as close to desiredNumSplits as possible, but does not have to match exactly. A low number of splits will limit the amount of parallelism in the source.
  
  Throws:
  
  Exception
- createReader
  
  public abstract UnboundedSource.UnboundedReader<OutputT> createReader(PipelineOptions options, @Nullable CheckpointMarkT checkpointMark) throws IOException
  
  Create a new UnboundedSource.UnboundedReader to read from this source, resuming from the given checkpoint if present.
  
  Throws:
  
  IOException
- getCheckpointMarkCoder
  
  public abstract Coder<CheckpointMarkT> getCheckpointMarkCoder()
  
  Returns a Coder for encoding and decoding the checkpoints for this source.
- requiresDeduping
  
  public boolean requiresDeduping()
  
  Returns whether this source requires explicit deduping.
  This is needed if the underlying data source can return the same record multiple times, such a queuing system with a pull-ack model. Sources where the records read are uniquely identified by the persisted state in the CheckpointMark do not need this.
  Generally, if UnboundedSource.CheckpointMark.finalizeCheckpoint() is overridden, this method should return true. Checkpoint finalization is best-effort, and readers can be resumed from a checkpoint that has not been finalized.
- offsetBasedDeduplicationSupported
  
  public boolean offsetBasedDeduplicationSupported()
  If offsetBasedDeduplicationSupported returns true, then the UnboundedSource needs to provide the following:
  
  UnboundedReader which provides offsets that are unique for each element and lexicographically ordered.
  CheckpointMark which provides an offset greater than all elements read and less than or equal to the next offset that will be read.

Class UnboundedSource<OutputT,CheckpointMarkT extends UnboundedSource.CheckpointMark>

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source

Constructor Summary

Method Summary

Methods inherited from class org.apache.beam.sdk.io.Source

Methods inherited from class java.lang.Object

Constructor Details

UnboundedSource

Method Details

split

createReader

getCheckpointMarkCoder

requiresDeduping

offsetBasedDeduplicationSupported