T - The type of records to be read from the source.@Experimental(value=SOURCE_SINK) public abstract class BlockBasedSource<T> extends FileBasedSource<T>
BlockBasedSource is a FileBasedSource where a file consists of blocks of
records.
BlockBasedSource should be derived from when a file format does not support efficient
seeking to a record in the file, but can support efficient seeking to a block. Alternatively,
records in the file cannot be offset-addressed, but blocks can (it is not possible to say that
record {code i} starts at offset m, but it is possible to say that block j starts
at offset n).
The records that will be read from a BlockBasedSource that corresponds to a subrange
of a file [startOffset, endOffset) are those records such that the record is contained in
a block that starts at offset i, where i >= startOffset and i <
endOffset. In other words, a record will be read from the source if its first byte is contained
in a block that begins within the range described by the source.
This entails that it is possible to determine the start offsets of all blocks in a file.
Progress reporting for reading from a BlockBasedSource is inaccurate. A BlockBasedSource.BlockBasedReader reports its current offset as (offset of current block) + (current block
size) * (fraction of block consumed). However, only the offset of the current block is required
to be accurately reported by subclass implementations. As such, in the worst case, the current
offset is only updated at block boundaries.
BlockBasedSource supports dynamic splitting. However, because records in a BlockBasedSource are not required to have offsets and progress reporting is inaccurate, BlockBasedReader only supports splitting at block boundaries. In other words, BlockBasedSource.BlockBasedReader.atSplitPoint returns true iff the current record is the first record in a
block. See FileBasedSource.FileBasedReader for discussion about split points.
| Modifier and Type | Class and Description |
|---|---|
protected static class |
BlockBasedSource.Block<T>
A
Block represents a block of records that can be read. |
protected static class |
BlockBasedSource.BlockBasedReader<T>
A
Reader that reads records from a BlockBasedSource. |
FileBasedSource.FileBasedReader<T>, FileBasedSource.ModeOffsetBasedSource.OffsetBasedReader<T>BoundedSource.BoundedReader<T>Source.Reader<T>| Constructor and Description |
|---|
BlockBasedSource(MatchResult.Metadata metadata,
long minBundleSize,
long startOffset,
long endOffset)
Creates a
BlockBasedSource for a single file. |
BlockBasedSource(java.lang.String fileOrPatternSpec,
EmptyMatchTreatment emptyMatchTreatment,
long minBundleSize)
Creates a
BlockBasedSource based on a file name or pattern. |
BlockBasedSource(java.lang.String fileOrPatternSpec,
long minBundleSize)
Like
BlockBasedSource(String, EmptyMatchTreatment, long) but with a default EmptyMatchTreatment of EmptyMatchTreatment.DISALLOW. |
BlockBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec,
EmptyMatchTreatment emptyMatchTreatment,
long minBundleSize)
|
BlockBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec,
long minBundleSize)
|
| Modifier and Type | Method and Description |
|---|---|
protected abstract BlockBasedSource<T> |
createForSubrangeOfFile(MatchResult.Metadata metadata,
long start,
long end)
Creates a
BlockBasedSource for the specified range in a single file. |
protected abstract BlockBasedSource.BlockBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates a
BlockBasedReader. |
createReader, createSourceForSubrange, getEmptyMatchTreatment, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, isSplittable, populateDisplayData, split, toString, validategetBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffsetgetDefaultOutputCoder, getOutputCoderpublic BlockBasedSource(java.lang.String fileOrPatternSpec,
EmptyMatchTreatment emptyMatchTreatment,
long minBundleSize)
BlockBasedSource based on a file name or pattern. Subclasses must call this
constructor when creating a BlockBasedSource for a file pattern. See FileBasedSource for more information.public BlockBasedSource(java.lang.String fileOrPatternSpec,
long minBundleSize)
BlockBasedSource(String, EmptyMatchTreatment, long) but with a default EmptyMatchTreatment of EmptyMatchTreatment.DISALLOW.public BlockBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec, long minBundleSize)
public BlockBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize)
public BlockBasedSource(MatchResult.Metadata metadata, long minBundleSize, long startOffset, long endOffset)
BlockBasedSource for a single file. Subclasses must call this constructor
when implementing createForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata, long, long). See documentation in FileBasedSource.protected abstract BlockBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata metadata, long start, long end)
BlockBasedSource for the specified range in a single file.createForSubrangeOfFile in class FileBasedSource<T>metadata - file backing the new FileBasedSource.start - starting byte offset of the new FileBasedSource.end - ending byte offset of the new FileBasedSource. May be Long.MAX_VALUE, in
which case it will be inferred using FileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions).protected abstract BlockBasedSource.BlockBasedReader<T> createSingleFileReader(PipelineOptions options)
BlockBasedReader.createSingleFileReader in class FileBasedSource<T>