T - Type of records represented by the source.public abstract class OffsetBasedSource<T> extends BoundedSource<T>
BoundedSource that uses offsets to define starting and ending positions.
OffsetBasedSource is a common base class for all bounded sources where the input can
be represented as a single range, and an input can be efficiently processed in parallel by
splitting the range into a set of disjoint ranges whose union is the original range. This class
should be used for sources that can be cheaply read starting at any given offset. OffsetBasedSource stores the range and implements splitting into bundles.
Extend OffsetBasedSource to implement your own offset-based custom source. FileBasedSource, which is a subclass of this, adds additional functionality useful for custom
sources that are based on files. If possible implementors should start from FileBasedSource instead of OffsetBasedSource.
Consult RangeTracker for important semantics common to all sources defined by a range
of positions of a certain type, including the semantics of split points (OffsetBasedSource.OffsetBasedReader.isAtSplitPoint()).
BoundedSource,
FileBasedSource,
RangeTracker,
Serialized Form| Modifier and Type | Class and Description |
|---|---|
static class |
OffsetBasedSource.OffsetBasedReader<T>
A
Source.Reader that implements code common to readers of all OffsetBasedSources. |
BoundedSource.BoundedReader<T>Source.Reader<T>| Constructor and Description |
|---|
OffsetBasedSource(long startOffset,
long endOffset,
long minBundleSize) |
| Modifier and Type | Method and Description |
|---|---|
abstract OffsetBasedSource<T> |
createSourceForSubrange(long start,
long end)
Returns an
OffsetBasedSource for a subrange of the current source. |
long |
getBytesPerOffset()
Returns approximately how many bytes of data correspond to a single offset in this source.
|
long |
getEndOffset()
Returns the specified ending offset of the source.
|
long |
getEstimatedSizeBytes(PipelineOptions options)
An estimate of the total size (in bytes) of the data that would be read from this source.
|
abstract long |
getMaxEndOffset(PipelineOptions options)
Returns the actual ending offset of the current source.
|
long |
getMinBundleSize()
Returns the minimum bundle size that should be used when splitting the source into sub-sources.
|
long |
getStartOffset()
Returns the starting offset of the source.
|
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
java.util.List<? extends OffsetBasedSource<T>> |
split(long desiredBundleSizeBytes,
PipelineOptions options)
Splits the source into bundles of approximately
desiredBundleSizeBytes. |
java.lang.String |
toString() |
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
createReadergetDefaultOutputCoder, getOutputCoderpublic OffsetBasedSource(long startOffset,
long endOffset,
long minBundleSize)
startOffset - starting offset (inclusive) of the source. Must be non-negative.endOffset - ending offset (exclusive) of the source. Use Long.MAX_VALUE to
indicate that the entire source after startOffset should be read. Must be >
startOffset.minBundleSize - minimum bundle size in offset units that should be used when splitting the
source into sub-sources. This value may not be respected if the total range of the source
is smaller than the specified minBundleSize. Must be non-negative.public long getStartOffset()
public long getEndOffset()
getMaxEndOffset(PipelineOptions) should be treated as getMaxEndOffset(PipelineOptions).public long getMinBundleSize()
minBundleSize.public long getEstimatedSizeBytes(PipelineOptions options) throws java.lang.Exception
BoundedSourceIf there is no way to estimate the size of the source implementations MAY return 0L.
getEstimatedSizeBytes in class BoundedSource<T>java.lang.Exceptionpublic java.util.List<? extends OffsetBasedSource<T>> split(long desiredBundleSizeBytes, PipelineOptions options) throws java.lang.Exception
BoundedSourcedesiredBundleSizeBytes.split in class BoundedSource<T>java.lang.Exceptionpublic void validate()
SourceIt is recommended to use Preconditions for implementing this
method.
public java.lang.String toString()
toString in class java.lang.Objectpublic long getBytesPerOffset()
getEstimatedSizeBytes(org.apache.beam.sdk.options.PipelineOptions) and split(long, org.apache.beam.sdk.options.PipelineOptions).
Defaults to 1 byte, which is the common case for, e.g., file sources.
public abstract long getMaxEndOffset(PipelineOptions options) throws java.lang.Exception
[startOffset, endOffset) such that the range
used is [startOffset, min(endOffset, maxEndOffset)).
As an example in which OffsetBasedSource is used to implement a file source, suppose
that this source was constructed with an endOffset of Long.MAX_VALUE to
indicate that a file should be read to the end. Then this function should determine the actual,
exact size of the file in bytes and return it.
java.lang.Exceptionpublic abstract OffsetBasedSource<T> createSourceForSubrange(long start, long end)
OffsetBasedSource for a subrange of the current source. The subrange [start, end) must be within the range [startOffset, endOffset) of the current source,
i.e. startOffset <= start < end <= endOffset.public void populateDisplayData(DisplayData.Builder builder)
SourcepopulateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData). Implementations may call super.populateDisplayData(builder) in order to register display data in the current namespace,
but should otherwise use subcomponent.populateDisplayData(builder) to use the namespace
of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData in interface HasDisplayDatapopulateDisplayData in class Source<T>builder - The builder to populate with display data.HasDisplayData