Class OffsetBasedSource<T>

Type Parameters:
T - Type of records represented by the source.
All Implemented Interfaces:
Serializable, HasDisplayData
Direct Known Subclasses:
FileBasedSource

public abstract class OffsetBasedSource<T> extends BoundedSource<T>
A BoundedSource that uses offsets to define starting and ending positions.

OffsetBasedSource is a common base class for all bounded sources where the input can be represented as a single range, and an input can be efficiently processed in parallel by splitting the range into a set of disjoint ranges whose union is the original range. This class should be used for sources that can be cheaply read starting at any given offset. OffsetBasedSource stores the range and implements splitting into bundles.

Extend OffsetBasedSource to implement your own offset-based custom source. FileBasedSource, which is a subclass of this, adds additional functionality useful for custom sources that are based on files. If possible implementors should start from FileBasedSource instead of OffsetBasedSource.

Consult RangeTracker for important semantics common to all sources defined by a range of positions of a certain type, including the semantics of split points (OffsetBasedSource.OffsetBasedReader.isAtSplitPoint()).

See Also:
  • Constructor Details

    • OffsetBasedSource

      public OffsetBasedSource(long startOffset, long endOffset, long minBundleSize)
      Parameters:
      startOffset - starting offset (inclusive) of the source. Must be non-negative.
      endOffset - ending offset (exclusive) of the source. Use Long.MAX_VALUE to indicate that the entire source after startOffset should be read. Must be > startOffset.
      minBundleSize - minimum bundle size in offset units that should be used when splitting the source into sub-sources. This value may not be respected if the total range of the source is smaller than the specified minBundleSize. Must be non-negative.
  • Method Details

    • getStartOffset

      public long getStartOffset()
      Returns the starting offset of the source.
    • getEndOffset

      public long getEndOffset()
      Returns the specified ending offset of the source. Any returned value greater than or equal to getMaxEndOffset(PipelineOptions) should be treated as getMaxEndOffset(PipelineOptions).
    • getMinBundleSize

      public long getMinBundleSize()
      Returns the minimum bundle size that should be used when splitting the source into sub-sources. This value may not be respected if the total range of the source is smaller than the specified minBundleSize.
    • getEstimatedSizeBytes

      public long getEstimatedSizeBytes(PipelineOptions options) throws Exception
      Description copied from class: BoundedSource
      An estimate of the total size (in bytes) of the data that would be read from this source. This estimate is in terms of external storage size, before any decompression or other processing done by the reader.

      If there is no way to estimate the size of the source implementations MAY return 0L.

      Specified by:
      getEstimatedSizeBytes in class BoundedSource<T>
      Throws:
      Exception
    • split

      public List<? extends OffsetBasedSource<T>> split(long desiredBundleSizeBytes, PipelineOptions options) throws Exception
      Description copied from class: BoundedSource
      Splits the source into bundles of approximately desiredBundleSizeBytes.
      Specified by:
      split in class BoundedSource<T>
      Throws:
      Exception
    • validate

      public void validate()
      Description copied from class: Source
      Checks that this source is valid, before it can be used in a pipeline.

      It is recommended to use Preconditions for implementing this method.

      Overrides:
      validate in class Source<T>
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • getBytesPerOffset

      public long getBytesPerOffset()
      Returns approximately how many bytes of data correspond to a single offset in this source. Used for translation between this source's range and methods defined in terms of bytes, such as getEstimatedSizeBytes(org.apache.beam.sdk.options.PipelineOptions) and split(long, org.apache.beam.sdk.options.PipelineOptions).

      Defaults to 1 byte, which is the common case for, e.g., file sources.

    • getMaxEndOffset

      public abstract long getMaxEndOffset(PipelineOptions options) throws Exception
      Returns the actual ending offset of the current source. The value returned by this function will be used to clip the end of the range [startOffset, endOffset) such that the range used is [startOffset, min(endOffset, maxEndOffset)).

      As an example in which OffsetBasedSource is used to implement a file source, suppose that this source was constructed with an endOffset of Long.MAX_VALUE to indicate that a file should be read to the end. Then this function should determine the actual, exact size of the file in bytes and return it.

      Throws:
      Exception
    • createSourceForSubrange

      public abstract OffsetBasedSource<T> createSourceForSubrange(long start, long end)
      Returns an OffsetBasedSource for a subrange of the current source. The subrange [start, end) must be within the range [startOffset, endOffset) of the current source, i.e. startOffset <= start < end <= endOffset.
    • populateDisplayData

      public void populateDisplayData(DisplayData.Builder builder)
      Description copied from class: Source
      Register display data for the given transform or component.

      populateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect display data via DisplayData.from(HasDisplayData). Implementations may call super.populateDisplayData(builder) in order to register display data in the current namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use the namespace of the subcomponent.

      By default, does not register any display data. Implementors may override this method to provide their own display data.

      Specified by:
      populateDisplayData in interface HasDisplayData
      Overrides:
      populateDisplayData in class Source<T>
      Parameters:
      builder - The builder to populate with display data.
      See Also: