Class OffsetBasedSource<T>
- Type Parameters:
T
- Type of records represented by the source.
- All Implemented Interfaces:
Serializable
,HasDisplayData
- Direct Known Subclasses:
FileBasedSource
BoundedSource
that uses offsets to define starting and ending positions.
OffsetBasedSource
is a common base class for all bounded sources where the input can
be represented as a single range, and an input can be efficiently processed in parallel by
splitting the range into a set of disjoint ranges whose union is the original range. This class
should be used for sources that can be cheaply read starting at any given offset. OffsetBasedSource
stores the range and implements splitting into bundles.
Extend OffsetBasedSource
to implement your own offset-based custom source. FileBasedSource
, which is a subclass of this, adds additional functionality useful for custom
sources that are based on files. If possible implementors should start from FileBasedSource
instead of OffsetBasedSource
.
Consult RangeTracker
for important semantics common to all sources defined by a range
of positions of a certain type, including the semantics of split points (OffsetBasedSource.OffsetBasedReader.isAtSplitPoint()
).
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
ASource.Reader
that implements code common to readers of allOffsetBasedSource
s.Nested classes/interfaces inherited from class org.apache.beam.sdk.io.BoundedSource
BoundedSource.BoundedReader<T>
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source
Source.Reader<T>
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionabstract OffsetBasedSource
<T> createSourceForSubrange
(long start, long end) Returns anOffsetBasedSource
for a subrange of the current source.long
Returns approximately how many bytes of data correspond to a single offset in this source.long
Returns the specified ending offset of the source.long
getEstimatedSizeBytes
(PipelineOptions options) An estimate of the total size (in bytes) of the data that would be read from this source.abstract long
getMaxEndOffset
(PipelineOptions options) Returns the actual ending offset of the current source.long
Returns the minimum bundle size that should be used when splitting the source into sub-sources.long
Returns the starting offset of the source.void
populateDisplayData
(DisplayData.Builder builder) Register display data for the given transform or component.List
<? extends OffsetBasedSource<T>> split
(long desiredBundleSizeBytes, PipelineOptions options) Splits the source into bundles of approximatelydesiredBundleSizeBytes
.toString()
void
validate()
Checks that this source is valid, before it can be used in a pipeline.Methods inherited from class org.apache.beam.sdk.io.BoundedSource
createReader
Methods inherited from class org.apache.beam.sdk.io.Source
getDefaultOutputCoder, getOutputCoder
-
Constructor Details
-
OffsetBasedSource
public OffsetBasedSource(long startOffset, long endOffset, long minBundleSize) - Parameters:
startOffset
- starting offset (inclusive) of the source. Must be non-negative.endOffset
- ending offset (exclusive) of the source. UseLong.MAX_VALUE
to indicate that the entire source afterstartOffset
should be read. Must be> startOffset
.minBundleSize
- minimum bundle size in offset units that should be used when splitting the source into sub-sources. This value may not be respected if the total range of the source is smaller than the specifiedminBundleSize
. Must be non-negative.
-
-
Method Details
-
getStartOffset
public long getStartOffset()Returns the starting offset of the source. -
getEndOffset
public long getEndOffset()Returns the specified ending offset of the source. Any returned value greater than or equal togetMaxEndOffset(PipelineOptions)
should be treated asgetMaxEndOffset(PipelineOptions)
. -
getMinBundleSize
public long getMinBundleSize()Returns the minimum bundle size that should be used when splitting the source into sub-sources. This value may not be respected if the total range of the source is smaller than the specifiedminBundleSize
. -
getEstimatedSizeBytes
Description copied from class:BoundedSource
An estimate of the total size (in bytes) of the data that would be read from this source. This estimate is in terms of external storage size, before any decompression or other processing done by the reader.If there is no way to estimate the size of the source implementations MAY return 0L.
- Specified by:
getEstimatedSizeBytes
in classBoundedSource<T>
- Throws:
Exception
-
split
public List<? extends OffsetBasedSource<T>> split(long desiredBundleSizeBytes, PipelineOptions options) throws Exception Description copied from class:BoundedSource
Splits the source into bundles of approximatelydesiredBundleSizeBytes
.- Specified by:
split
in classBoundedSource<T>
- Throws:
Exception
-
validate
public void validate()Description copied from class:Source
Checks that this source is valid, before it can be used in a pipeline.It is recommended to use
Preconditions
for implementing this method. -
toString
-
getBytesPerOffset
public long getBytesPerOffset()Returns approximately how many bytes of data correspond to a single offset in this source. Used for translation between this source's range and methods defined in terms of bytes, such asgetEstimatedSizeBytes(org.apache.beam.sdk.options.PipelineOptions)
andsplit(long, org.apache.beam.sdk.options.PipelineOptions)
.Defaults to
1
byte, which is the common case for, e.g., file sources. -
getMaxEndOffset
Returns the actual ending offset of the current source. The value returned by this function will be used to clip the end of the range[startOffset, endOffset)
such that the range used is[startOffset, min(endOffset, maxEndOffset))
.As an example in which
OffsetBasedSource
is used to implement a file source, suppose that this source was constructed with anendOffset
ofLong.MAX_VALUE
to indicate that a file should be read to the end. Then this function should determine the actual, exact size of the file in bytes and return it.- Throws:
Exception
-
createSourceForSubrange
Returns anOffsetBasedSource
for a subrange of the current source. The subrange[start, end)
must be within the range[startOffset, endOffset)
of the current source, i.e.startOffset <= start < end <= endOffset
. -
populateDisplayData
Description copied from class:Source
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Overrides:
populateDisplayData
in classSource<T>
- Parameters:
builder
- The builder to populate with display data.- See Also:
-