T
- The type of records to be read from the source.@Experimental(value=SOURCE_SINK) public class AvroSource<T> extends BlockBasedSource<T>
AvroIO.Read
.
A FileBasedSource
for reading Avro files.
To read a PCollection
of objects from one or more Avro files, use
from(java.lang.String)
to specify the path(s) of the files to read. The AvroSource
that
is returned will read objects of type GenericRecord
with the schema(s) that were written
at file creation. To further configure the AvroSource
to read with a user-defined schema,
or to return records of a type other than GenericRecord
, use
withSchema(Schema)
(using an Avro Schema
),
withSchema(String)
(using a JSON schema), or
withSchema(Class)
(to return objects of the Avro-generated class specified).
An AvroSource
can be read from using the Read
transform. For example:
AvroSource<MyType> source = AvroSource.from(file.toPath()).withSchema(MyType.class);
PCollection<MyType> records = Read.from(mySource);
This class's implementation is based on the Avro 1.7.7 specification and implements parsing of some parts of Avro Object Container Files. The rationale for doing so is that the Avro API does not provide efficient ways of computing the precise offsets of blocks within a file, which is necessary to support dynamic work rebalancing. However, whenever it is possible to use the Avro API in a way that supports maintaining precise offsets, this class uses the Avro API.
Avro Object Container files store records in blocks. Each block contains a collection of records. Blocks may be encoded (e.g., with bzip2, deflate, snappy, etc.). Blocks are delineated from one another by a 16-byte sync marker.
An AvroSource
for a subrange of a single file contains records in the blocks such that
the start offset of the block is greater than or equal to the start offset of the source and less
than the end offset of the source.
To use XZ-encoded Avro files, please include an explicit dependency on xz-1.5.jar
,
which has been marked as optional in the Maven sdk/pom.xml
.
<dependency>
<groupId>org.tukaani</groupId>
<artifactId>xz</artifactId>
<version>1.5</version>
</dependency>
Permission requirements depend on the PipelineRunner
that is used to execute the
pipeline. Please refer to the documentation of corresponding PipelineRunner
s for
more details.
Modifier and Type | Class and Description |
---|---|
static class |
AvroSource.AvroReader<T>
A
BlockBasedSource.BlockBasedReader for reading blocks from Avro files. |
BlockBasedSource.Block<T>, BlockBasedSource.BlockBasedReader<T>
FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
OffsetBasedSource.OffsetBasedReader<T>
BoundedSource.BoundedReader<T>
Source.Reader<T>
Modifier and Type | Method and Description |
---|---|
BlockBasedSource<T> |
createForSubrangeOfFile(MatchResult.Metadata fileMetadata,
long start,
long end)
Creates a
BlockBasedSource for the specified range in a single file. |
BlockBasedSource<T> |
createForSubrangeOfFile(java.lang.String fileName,
long start,
long end)
Deprecated.
|
protected BlockBasedSource.BlockBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates a
BlockBasedReader . |
static AvroSource<GenericRecord> |
from(java.lang.String fileNameOrPattern)
Creates an
AvroSource that reads from the given file name or pattern ("glob"). |
AvroCoder<T> |
getDefaultOutputCoder()
Returns the default
Coder to use for the data read from this source. |
java.lang.String |
getSchema() |
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
AvroSource<T> |
withMinBundleSize(long minBundleSize)
Returns an
AvroSource that's like this one but uses the supplied minimum bundle size. |
<X> AvroSource<X> |
withSchema(java.lang.Class<X> clazz)
Returns an
AvroSource that's like this one but reads files containing records of the
type of the given class. |
AvroSource<GenericRecord> |
withSchema(Schema schema)
Returns an
AvroSource that's like this one but reads files containing records that
conform to the given schema. |
AvroSource<GenericRecord> |
withSchema(java.lang.String schema)
Returns an
AvroSource that's like this one but reads files containing records that
conform to the given schema. |
createReader, createSourceForSubrange, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, isSplittable, populateDisplayData, split, toString
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
public static AvroSource<GenericRecord> from(java.lang.String fileNameOrPattern)
AvroSource
that reads from the given file name or pattern ("glob"). The
returned source can be further configured by calling withSchema(java.lang.String)
to return a type other
than GenericRecord
.public AvroSource<GenericRecord> withSchema(java.lang.String schema)
AvroSource
that's like this one but reads files containing records that
conform to the given schema.
Does not modify this object.
public AvroSource<GenericRecord> withSchema(Schema schema)
AvroSource
that's like this one but reads files containing records that
conform to the given schema.
Does not modify this object.
public <X> AvroSource<X> withSchema(java.lang.Class<X> clazz)
AvroSource
that's like this one but reads files containing records of the
type of the given class.
Does not modify this object.
public AvroSource<T> withMinBundleSize(long minBundleSize)
AvroSource
that's like this one but uses the supplied minimum bundle size.
Refer to OffsetBasedSource
for a description of minBundleSize
and its use.
Does not modify this object.
public void validate()
Source
It is recommended to use Preconditions
for implementing
this method.
validate
in class FileBasedSource<T>
@Deprecated public BlockBasedSource<T> createForSubrangeOfFile(java.lang.String fileName, long start, long end) throws java.io.IOException
java.io.IOException
public BlockBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata fileMetadata, long start, long end)
BlockBasedSource
BlockBasedSource
for the specified range in a single file.createForSubrangeOfFile
in class BlockBasedSource<T>
fileMetadata
- file backing the new FileBasedSource
.start
- starting byte offset of the new FileBasedSource
.end
- ending byte offset of the new FileBasedSource
. May be Long.MAX_VALUE,
in which case it will be inferred using FileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions)
.protected BlockBasedSource.BlockBasedReader<T> createSingleFileReader(PipelineOptions options)
BlockBasedSource
BlockBasedReader
.createSingleFileReader
in class BlockBasedSource<T>
public AvroCoder<T> getDefaultOutputCoder()
Source
Coder
to use for the data read from this source.getDefaultOutputCoder
in class Source<T>
public java.lang.String getSchema()