T
- The type of records to be read from the source.@Experimental(value=SOURCE_SINK) public abstract class BlockBasedSource<T> extends FileBasedSource<T>
BlockBasedSource
is a FileBasedSource
where a file consists of blocks of
records.
BlockBasedSource
should be derived from when a file format does not support efficient
seeking to a record in the file, but can support efficient seeking to a block. Alternatively,
records in the file cannot be offset-addressed, but blocks can (it is not possible to say
that record {code i} starts at offset m
, but it is possible to say that block j
starts at offset n
).
The records that will be read from a BlockBasedSource
that corresponds to a subrange
of a file [startOffset, endOffset)
are those records such that the record is contained in
a block that starts at offset i
, where i >= startOffset
and
i < endOffset
. In other words, a record will be read from the source if its first byte is
contained in a block that begins within the range described by the source.
This entails that it is possible to determine the start offsets of all blocks in a file.
Progress reporting for reading from a BlockBasedSource
is inaccurate. A BlockBasedSource.BlockBasedReader
reports its current offset as (offset of current block) + (current block
size) * (fraction of block consumed)
. However, only the offset of the current block is required
to be accurately reported by subclass implementations. As such, in the worst case, the current
offset is only updated at block boundaries.
BlockBasedSource
supports dynamic splitting. However, because records in a BlockBasedSource
are not required to have offsets and progress reporting is inaccurate, BlockBasedReader
only supports splitting at block boundaries.
In other words, BlockBasedSource.BlockBasedReader.atSplitPoint
returns true iff the current record is the
first record in a block. See FileBasedSource.FileBasedReader
for discussion about split
points.
Modifier and Type | Class and Description |
---|---|
protected static class |
BlockBasedSource.Block<T>
A
Block represents a block of records that can be read. |
protected static class |
BlockBasedSource.BlockBasedReader<T>
A
Reader that reads records from a BlockBasedSource . |
FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
OffsetBasedSource.OffsetBasedReader<T>
BoundedSource.BoundedReader<T>
Source.Reader<T>
Constructor and Description |
---|
BlockBasedSource(MatchResult.Metadata metadata,
long minBundleSize,
long startOffset,
long endOffset)
Creates a
BlockBasedSource for a single file. |
BlockBasedSource(java.lang.String fileOrPatternSpec,
long minBundleSize)
Creates a
BlockBasedSource based on a file name or pattern. |
Modifier and Type | Method and Description |
---|---|
protected abstract BlockBasedSource<T> |
createForSubrangeOfFile(MatchResult.Metadata metadata,
long start,
long end)
Creates a
BlockBasedSource for the specified range in a single file. |
protected abstract BlockBasedSource.BlockBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates a
BlockBasedReader . |
createReader, createSourceForSubrange, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, isSplittable, populateDisplayData, split, toString, validate
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
getDefaultOutputCoder
public BlockBasedSource(java.lang.String fileOrPatternSpec, long minBundleSize)
BlockBasedSource
based on a file name or pattern. Subclasses must call this
constructor when creating a BlockBasedSource
for a file pattern. See
FileBasedSource
for more information.public BlockBasedSource(MatchResult.Metadata metadata, long minBundleSize, long startOffset, long endOffset)
BlockBasedSource
for a single file. Subclasses must call this constructor
when implementing createForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata, long, long)
. See documentation in
FileBasedSource
.protected abstract BlockBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata metadata, long start, long end)
BlockBasedSource
for the specified range in a single file.createForSubrangeOfFile
in class FileBasedSource<T>
metadata
- file backing the new FileBasedSource
.start
- starting byte offset of the new FileBasedSource
.end
- ending byte offset of the new FileBasedSource
. May be Long.MAX_VALUE,
in which case it will be inferred using FileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions)
.protected abstract BlockBasedSource.BlockBasedReader<T> createSingleFileReader(PipelineOptions options)
BlockBasedReader
.createSingleFileReader
in class FileBasedSource<T>