T
- The type of records contained in the block.@Experimental(value=SOURCE_SINK) public static class AvroSource.AvroReader<T> extends BlockBasedSource.BlockBasedReader<T>
BlockBasedSource.BlockBasedReader
for reading blocks from Avro files.
An Avro Object Container File consists of a header followed by a 16-bit sync marker and then a sequence of blocks, where each block begins with two encoded longs representing the total number of records in the block and the block's size in bytes, followed by the block's (optionally-encoded) records. Each block is terminated by a 16-bit sync marker.
SPLIT_POINTS_UNKNOWN
Constructor and Description |
---|
AvroReader(AvroSource<T> source)
Reads Avro records of type
T from the specified source. |
Modifier and Type | Method and Description |
---|---|
org.apache.beam.sdk.io.AvroSource.AvroBlock<T> |
getCurrentBlock()
Returns the current block (the block that was read by the last successful call to
BlockBasedSource.BlockBasedReader.readNextBlock() ). |
long |
getCurrentBlockOffset()
Returns the largest offset such that starting to read from that offset includes the current
block.
|
long |
getCurrentBlockSize()
Returns the size of the current block in bytes as it is represented in the underlying file,
if possible.
|
AvroSource<T> |
getCurrentSource()
Returns a
Source describing the same input that this Reader currently reads
(including items already read). |
long |
getSplitPointsRemaining()
Returns the total amount of parallelism in the unprocessed part of this reader's current
BoundedSource (as would be returned by BoundedSource.BoundedReader.getCurrentSource() ). |
boolean |
readNextBlock()
Read the next block from the input.
|
protected void |
startReading(java.nio.channels.ReadableByteChannel channel)
Performs any initialization of the subclass of
FileBasedReader that involves IO
operations. |
getCurrent, getCurrentOffset, getFractionConsumed, isAtSplitPoint, readNextRecord
advanceImpl, allowsDynamicSplitting, close, startImpl
advance, getSplitPointsConsumed, isDone, isStarted, splitAtFraction, start
getCurrentTimestamp
public AvroReader(AvroSource<T> source)
T
from the specified source.public AvroSource<T> getCurrentSource()
BoundedSource.BoundedReader
Source
describing the same input that this Reader
currently reads
(including items already read).
Reader subclasses can use this method for convenience to access unchanging properties of the source being read. Alternatively, they can cache these properties in the constructor.
The framework will call this method in the course of dynamic work rebalancing, e.g. after
a successful BoundedSource.BoundedReader.splitAtFraction(double)
call.
Remember that Source
objects must always be immutable. However, the return value
of this function may be affected by dynamic work rebalancing, happening asynchronously via
BoundedSource.BoundedReader.splitAtFraction(double)
, meaning it can return a different Source
object. However, the returned object itself will still itself be immutable. Callers
must take care not to rely on properties of the returned source that may be asynchronously
changed as a result of this process (e.g. do not cache an end offset when reading a file).
For convenience, subclasses should usually return the most concrete subclass of Source
possible. In practice, the implementation of this method should nearly always be one
of the following:
BoundedSource.BoundedReader.getCurrentSource()
: delegate to base class. In this case, it is almost always an error
for the subclass to maintain its own copy of the source.
public FooReader(FooSource<T> source) {
super(source);
}
public FooSource<T> getCurrentSource() {
return (FooSource<T>)super.getCurrentSource();
}
private final FooSource<T> source;
public FooReader(FooSource<T> source) {
this.source = source;
}
public FooSource<T> getCurrentSource() {
return source;
}
BoundedSource.BoundedReader
that explicitly supports dynamic work rebalancing:
maintain a variable pointing to an immutable source object, and protect it with
synchronization.
private FooSource<T> source;
public FooReader(FooSource<T> source) {
this.source = source;
}
public synchronized FooSource<T> getCurrentSource() {
return source;
}
public synchronized FooSource<T> splitAtFraction(double fraction) {
...
FooSource<T> primary = ...;
FooSource<T> residual = ...;
this.source = primary;
return residual;
}
getCurrentSource
in class FileBasedSource.FileBasedReader<T>
public boolean readNextBlock() throws java.io.IOException
BlockBasedSource.BlockBasedReader
readNextBlock
in class BlockBasedSource.BlockBasedReader<T>
java.io.IOException
public org.apache.beam.sdk.io.AvroSource.AvroBlock<T> getCurrentBlock()
BlockBasedSource.BlockBasedReader
BlockBasedSource.BlockBasedReader.readNextBlock()
). May return null initially, or if no block has been
successfully read.getCurrentBlock
in class BlockBasedSource.BlockBasedReader<T>
public long getCurrentBlockOffset()
BlockBasedSource.BlockBasedReader
getCurrentBlockOffset
in class BlockBasedSource.BlockBasedReader<T>
public long getCurrentBlockSize()
BlockBasedSource.BlockBasedReader
0
if the size of the current block is unknown.
The size returned by this method must be such that for two successive blocks A and B,
offset(A) + size(A) <= offset(B)
. If this is not satisfied, the progress reported by
the BlockBasedReader
will be non-monotonic and will interfere with the quality (but
not correctness) of dynamic work rebalancing.
This method and BlockBasedSource.Block.getFractionOfBlockConsumed()
are used to provide an estimate
of progress within a block (getCurrentBlock().getFractionOfBlockConsumed() *
getCurrentBlockSize()
). It is acceptable for the result of this computation to be 0
,
but progress estimation will be inaccurate.
getCurrentBlockSize
in class BlockBasedSource.BlockBasedReader<T>
public long getSplitPointsRemaining()
BoundedSource.BoundedReader
BoundedSource
(as would be returned by BoundedSource.BoundedReader.getCurrentSource()
). This corresponds
to all unprocessed split point records (see RangeTracker
), including the last split
point returned, in the remainder part of the source.
This function should be implemented only in addition to BoundedSource.BoundedReader.getSplitPointsConsumed()
and only if an exact value can be returned.
Consider the following examples: (1) An input that can be read in parallel down to the
individual records, such as CountingSource.upTo(long)
, is called "perfectly splittable".
(2) a "block-compressed" file format such as AvroIO
, in which a block of records has
to be read as a whole, but different blocks can be read in parallel. (3) An "unsplittable"
input such as a cursor in a database.
Assume for examples (1) and (2) that the number of records or blocks remaining is known:
reader
for which the last call to Source.Reader.start()
or Source.Reader.advance()
has returned true should should not return 0, because this reader itself
represents parallelism at least 1. This condition holds independent of whether the
input is splittable.
Source.Reader.start()
or Source.Reader.advance()
) has returned false
should return a value of 0. This condition holds independent of whether the input is
splittable.
Defaults to BoundedSource.BoundedReader.SPLIT_POINTS_UNKNOWN
. Any value less than 0 will be interpreted as
unknown.
BoundedSource.BoundedReader
for information about thread safety.getSplitPointsRemaining
in class OffsetBasedSource.OffsetBasedReader<T>
BoundedSource.BoundedReader.getSplitPointsConsumed()
protected void startReading(java.nio.channels.ReadableByteChannel channel) throws java.io.IOException
FileBasedSource.FileBasedReader
FileBasedReader
that involves IO
operations. Will only be invoked once and before that invocation the base class will seek the
channel to the source's starting offset.
Provided ReadableByteChannel
is for the file represented by the source of this
reader. Subclass may use the channel
to build a higher level IO abstraction, e.g., a
BufferedReader or an XML parser.
If the corresponding source is for a subrange of a file, channel
is guaranteed to
be an instance of the type SeekableByteChannel
.
After this method is invoked the base class will not be reading data from the channel or adjusting the position of the channel. But the base class is responsible for properly closing the channel.
startReading
in class FileBasedSource.FileBasedReader<T>
channel
- a byte channel representing the file backing the reader.java.io.IOException