T
- Type of records represented by the source.public abstract class OffsetBasedSource<T> extends BoundedSource<T>
BoundedSource
that uses offsets to define starting and ending positions.
OffsetBasedSource
is a common base class for all bounded sources where the input can
be represented as a single range, and an input can be efficiently processed in parallel by
splitting the range into a set of disjoint ranges whose union is the original range. This class
should be used for sources that can be cheaply read starting at any given offset. OffsetBasedSource
stores the range and implements splitting into bundles.
Extend OffsetBasedSource
to implement your own offset-based custom source. FileBasedSource
, which is a subclass of this, adds additional functionality useful for custom
sources that are based on files. If possible implementors should start from FileBasedSource
instead of OffsetBasedSource
.
Consult RangeTracker
for important semantics common to all sources defined by a range
of positions of a certain type, including the semantics of split points (OffsetBasedSource.OffsetBasedReader.isAtSplitPoint()
).
BoundedSource
,
FileBasedSource
,
RangeTracker
,
Serialized FormModifier and Type | Class and Description |
---|---|
static class |
OffsetBasedSource.OffsetBasedReader<T>
A
Source.Reader that implements code common to readers of all OffsetBasedSource s. |
BoundedSource.BoundedReader<T>
Source.Reader<T>
Constructor and Description |
---|
OffsetBasedSource(long startOffset,
long endOffset,
long minBundleSize) |
Modifier and Type | Method and Description |
---|---|
abstract OffsetBasedSource<T> |
createSourceForSubrange(long start,
long end)
Returns an
OffsetBasedSource for a subrange of the current source. |
long |
getBytesPerOffset()
Returns approximately how many bytes of data correspond to a single offset in this source.
|
long |
getEndOffset()
Returns the specified ending offset of the source.
|
long |
getEstimatedSizeBytes(PipelineOptions options)
An estimate of the total size (in bytes) of the data that would be read from this source.
|
abstract long |
getMaxEndOffset(PipelineOptions options)
Returns the actual ending offset of the current source.
|
long |
getMinBundleSize()
Returns the minimum bundle size that should be used when splitting the source into sub-sources.
|
long |
getStartOffset()
Returns the starting offset of the source.
|
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
java.util.List<? extends OffsetBasedSource<T>> |
split(long desiredBundleSizeBytes,
PipelineOptions options)
Splits the source into bundles of approximately
desiredBundleSizeBytes . |
java.lang.String |
toString() |
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
createReader
getDefaultOutputCoder, getOutputCoder
public OffsetBasedSource(long startOffset, long endOffset, long minBundleSize)
startOffset
- starting offset (inclusive) of the source. Must be non-negative.endOffset
- ending offset (exclusive) of the source. Use Long.MAX_VALUE
to
indicate that the entire source after startOffset
should be read. Must be >
startOffset
.minBundleSize
- minimum bundle size in offset units that should be used when splitting the
source into sub-sources. This value may not be respected if the total range of the source
is smaller than the specified minBundleSize
. Must be non-negative.public long getStartOffset()
public long getEndOffset()
getMaxEndOffset(PipelineOptions)
should be treated as getMaxEndOffset(PipelineOptions)
.public long getMinBundleSize()
minBundleSize
.public long getEstimatedSizeBytes(PipelineOptions options) throws java.lang.Exception
BoundedSource
If there is no way to estimate the size of the source implementations MAY return 0L.
getEstimatedSizeBytes
in class BoundedSource<T>
java.lang.Exception
public java.util.List<? extends OffsetBasedSource<T>> split(long desiredBundleSizeBytes, PipelineOptions options) throws java.lang.Exception
BoundedSource
desiredBundleSizeBytes
.split
in class BoundedSource<T>
java.lang.Exception
public void validate()
Source
It is recommended to use Preconditions
for implementing this method.
public java.lang.String toString()
toString
in class java.lang.Object
public long getBytesPerOffset()
getEstimatedSizeBytes(org.apache.beam.sdk.options.PipelineOptions)
and split(long, org.apache.beam.sdk.options.PipelineOptions)
.
Defaults to 1
byte, which is the common case for, e.g., file sources.
public abstract long getMaxEndOffset(PipelineOptions options) throws java.lang.Exception
[startOffset, endOffset)
such that the range
used is [startOffset, min(endOffset, maxEndOffset))
.
As an example in which OffsetBasedSource
is used to implement a file source, suppose
that this source was constructed with an endOffset
of Long.MAX_VALUE
to
indicate that a file should be read to the end. Then this function should determine the actual,
exact size of the file in bytes and return it.
java.lang.Exception
public abstract OffsetBasedSource<T> createSourceForSubrange(long start, long end)
OffsetBasedSource
for a subrange of the current source. The subrange [start, end)
must be within the range [startOffset, endOffset)
of the current source,
i.e. startOffset <= start < end <= endOffset
.public void populateDisplayData(DisplayData.Builder builder)
Source
populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData)
. Implementations may call super.populateDisplayData(builder)
in order to register display data in the current namespace,
but should otherwise use subcomponent.populateDisplayData(builder)
to use the namespace
of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData
in interface HasDisplayData
populateDisplayData
in class Source<T>
builder
- The builder to populate with display data.HasDisplayData