Class BlockBasedSource<T>
- Type Parameters:
T
- The type of records to be read from the source.
- All Implemented Interfaces:
Serializable
,HasDisplayData
- Direct Known Subclasses:
AvroSource
BlockBasedSource
is a FileBasedSource
where a file consists of blocks of
records.
BlockBasedSource
should be derived from when a file format does not support efficient
seeking to a record in the file, but can support efficient seeking to a block. Alternatively,
records in the file cannot be offset-addressed, but blocks can (it is not possible to say that
record {code i} starts at offset m
, but it is possible to say that block j
starts
at offset n
).
The records that will be read from a BlockBasedSource
that corresponds to a subrange
of a file [startOffset, endOffset)
are those records such that the record is contained in
a block that starts at offset i
, where i >= startOffset
and i <
endOffset
. In other words, a record will be read from the source if its first byte is contained
in a block that begins within the range described by the source.
This entails that it is possible to determine the start offsets of all blocks in a file.
Progress reporting for reading from a BlockBasedSource
is inaccurate. A BlockBasedSource.BlockBasedReader
reports its current offset as (offset of current block) + (current block
size) * (fraction of block consumed)
. However, only the offset of the current block is required
to be accurately reported by subclass implementations. As such, in the worst case, the current
offset is only updated at block boundaries.
BlockBasedSource
supports dynamic splitting. However, because records in a
BlockBasedSource
are not required to have offsets and progress reporting is inaccurate,
BlockBasedReader
only supports splitting at block boundaries. In other words, BlockBasedSource.BlockBasedReader.atSplitPoint
returns true iff the current record is the first record in a
block. See FileBasedSource.FileBasedReader
for discussion about split points.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionprotected static class
ABlock
represents a block of records that can be read.static class
AReader
that reads records from aBlockBasedSource
.Nested classes/interfaces inherited from class org.apache.beam.sdk.io.FileBasedSource
FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.OffsetBasedSource
OffsetBasedSource.OffsetBasedReader<T>
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.BoundedSource
BoundedSource.BoundedReader<T>
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source
Source.Reader<T>
-
Constructor Summary
ConstructorsConstructorDescriptionBlockBasedSource
(String fileOrPatternSpec, long minBundleSize) LikeBlockBasedSource(String, EmptyMatchTreatment, long)
but with a defaultEmptyMatchTreatment
ofEmptyMatchTreatment.DISALLOW
.BlockBasedSource
(String fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize) Creates aBlockBasedSource
based on a file name or pattern.BlockBasedSource
(MatchResult.Metadata metadata, long minBundleSize, long startOffset, long endOffset) Creates aBlockBasedSource
for a single file.BlockBasedSource
(ValueProvider<String> fileOrPatternSpec, long minBundleSize) BlockBasedSource
(ValueProvider<String> fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize) -
Method Summary
Modifier and TypeMethodDescriptionprotected abstract BlockBasedSource
<T> createForSubrangeOfFile
(MatchResult.Metadata metadata, long start, long end) Creates aBlockBasedSource
for the specified range in a single file.protected abstract BlockBasedSource.BlockBasedReader
<T> createSingleFileReader
(PipelineOptions options) Creates aBlockBasedReader
.Methods inherited from class org.apache.beam.sdk.io.FileBasedSource
createReader, createSourceForSubrange, getEmptyMatchTreatment, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, isSplittable, populateDisplayData, split, toString, validate
Methods inherited from class org.apache.beam.sdk.io.OffsetBasedSource
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
Methods inherited from class org.apache.beam.sdk.io.Source
getDefaultOutputCoder, getOutputCoder
-
Constructor Details
-
BlockBasedSource
public BlockBasedSource(String fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize) Creates aBlockBasedSource
based on a file name or pattern. Subclasses must call this constructor when creating aBlockBasedSource
for a file pattern. SeeFileBasedSource
for more information. -
BlockBasedSource
LikeBlockBasedSource(String, EmptyMatchTreatment, long)
but with a defaultEmptyMatchTreatment
ofEmptyMatchTreatment.DISALLOW
. -
BlockBasedSource
-
BlockBasedSource
public BlockBasedSource(ValueProvider<String> fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize) -
BlockBasedSource
public BlockBasedSource(MatchResult.Metadata metadata, long minBundleSize, long startOffset, long endOffset) Creates aBlockBasedSource
for a single file. Subclasses must call this constructor when implementingcreateForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata, long, long)
. See documentation inFileBasedSource
.
-
-
Method Details
-
createForSubrangeOfFile
protected abstract BlockBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata metadata, long start, long end) Creates aBlockBasedSource
for the specified range in a single file.- Specified by:
createForSubrangeOfFile
in classFileBasedSource<T>
- Parameters:
metadata
- file backing the newFileBasedSource
.start
- starting byte offset of the newFileBasedSource
.end
- ending byte offset of the newFileBasedSource
. May be Long.MAX_VALUE, in which case it will be inferred usingFileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions)
.
-
createSingleFileReader
protected abstract BlockBasedSource.BlockBasedReader<T> createSingleFileReader(PipelineOptions options) Creates aBlockBasedReader
.- Specified by:
createSingleFileReader
in classFileBasedSource<T>
-