T
- The type of records to be read from the source.@Experimental(value=SOURCE_SINK) public class AvroSource<T> extends BlockBasedSource<T>
AvroIO.Read
.
A FileBasedSource
for reading Avro files.
To read a PCollection
of objects from one or more Avro files, use
from(org.apache.beam.sdk.options.ValueProvider<java.lang.String>)
to specify the path(s) of the files to read. The AvroSource
that
is returned will read objects of type GenericRecord
with the schema(s) that were written
at file creation. To further configure the AvroSource
to read with a user-defined schema,
or to return records of a type other than GenericRecord
, use
withSchema(Schema)
(using an Avro Schema
),
withSchema(String)
(using a JSON schema), or
withSchema(Class)
(to return objects of the Avro-generated class specified).
An AvroSource
can be read from using the Read
transform. For example:
AvroSource<MyType> source = AvroSource.from(file.toPath()).withSchema(MyType.class);
PCollection<MyType> records = Read.from(mySource);
This class's implementation is based on the Avro 1.7.7 specification and implements parsing of some parts of Avro Object Container Files. The rationale for doing so is that the Avro API does not provide efficient ways of computing the precise offsets of blocks within a file, which is necessary to support dynamic work rebalancing. However, whenever it is possible to use the Avro API in a way that supports maintaining precise offsets, this class uses the Avro API.
Avro Object Container files store records in blocks. Each block contains a collection of records. Blocks may be encoded (e.g., with bzip2, deflate, snappy, etc.). Blocks are delineated from one another by a 16-byte sync marker.
An AvroSource
for a subrange of a single file contains records in the blocks such that
the start offset of the block is greater than or equal to the start offset of the source and less
than the end offset of the source.
To use XZ-encoded Avro files, please include an explicit dependency on xz-1.5.jar
,
which has been marked as optional in the Maven sdk/pom.xml
.
<dependency>
<groupId>org.tukaani</groupId>
<artifactId>xz</artifactId>
<version>1.5</version>
</dependency>
Permission requirements depend on the PipelineRunner
that is used to execute the
pipeline. Please refer to the documentation of corresponding PipelineRunner
s for
more details.
Modifier and Type | Class and Description |
---|---|
static class |
AvroSource.AvroReader<T>
A
BlockBasedSource.BlockBasedReader for reading blocks from Avro files. |
BlockBasedSource.Block<T>, BlockBasedSource.BlockBasedReader<T>
FileBasedSource.FileBasedReader<T>
OffsetBasedSource.OffsetBasedReader<T>
BoundedSource.BoundedReader<T>
Source.Reader<T>
Modifier and Type | Method and Description |
---|---|
BlockBasedSource<T> |
createForSubrangeOfFile(MatchResult.Metadata fileMetadata,
long start,
long end)
Creates a
BlockBasedSource for the specified range in a single file. |
BlockBasedSource<T> |
createForSubrangeOfFile(java.lang.String fileName,
long start,
long end)
Deprecated.
Used by Dataflow worker
|
protected BlockBasedSource.BlockBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates a
BlockBasedReader . |
static AvroSource<GenericRecord> |
from(java.lang.String fileNameOrPattern)
Like
from(ValueProvider) . |
static AvroSource<GenericRecord> |
from(ValueProvider<java.lang.String> fileNameOrPattern)
Reads from the given file name or pattern ("glob").
|
Coder<T> |
getOutputCoder()
Returns the
Coder to use for the data read from this source. |
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
AvroSource<T> |
withEmptyMatchTreatment(EmptyMatchTreatment emptyMatchTreatment) |
AvroSource<T> |
withMinBundleSize(long minBundleSize)
Sets the minimum bundle size.
|
<X> AvroSource<X> |
withParseFn(SerializableFunction<GenericRecord,X> parseFn,
Coder<X> coder)
Reads
GenericRecord of unspecified schema and maps them to instances of a custom type
using the given parseFn and encoded using the given coder. |
<X> AvroSource<X> |
withSchema(java.lang.Class<X> clazz)
Reads files containing records of the given class.
|
AvroSource<GenericRecord> |
withSchema(Schema schema)
Like
withSchema(String) . |
AvroSource<GenericRecord> |
withSchema(java.lang.String schema)
Reads files containing records that conform to the given schema.
|
createReader, createSourceForSubrange, getEmptyMatchTreatment, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, isSplittable, populateDisplayData, split, toString
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
getDefaultOutputCoder
public static AvroSource<GenericRecord> from(ValueProvider<java.lang.String> fileNameOrPattern)
withSchema(java.lang.String)
to return a type other than GenericRecord
.public static AvroSource<GenericRecord> from(java.lang.String fileNameOrPattern)
from(ValueProvider)
.public AvroSource<T> withEmptyMatchTreatment(EmptyMatchTreatment emptyMatchTreatment)
public AvroSource<GenericRecord> withSchema(java.lang.String schema)
public AvroSource<GenericRecord> withSchema(Schema schema)
withSchema(String)
.public <X> AvroSource<X> withSchema(java.lang.Class<X> clazz)
public <X> AvroSource<X> withParseFn(SerializableFunction<GenericRecord,X> parseFn, Coder<X> coder)
GenericRecord
of unspecified schema and maps them to instances of a custom type
using the given parseFn
and encoded using the given coder.public AvroSource<T> withMinBundleSize(long minBundleSize)
OffsetBasedSource
for a description of minBundleSize
and its use.public void validate()
Source
It is recommended to use Preconditions
for implementing
this method.
validate
in class FileBasedSource<T>
@Deprecated public BlockBasedSource<T> createForSubrangeOfFile(java.lang.String fileName, long start, long end) throws java.io.IOException
java.io.IOException
public BlockBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata fileMetadata, long start, long end)
BlockBasedSource
BlockBasedSource
for the specified range in a single file.createForSubrangeOfFile
in class BlockBasedSource<T>
fileMetadata
- file backing the new FileBasedSource
.start
- starting byte offset of the new FileBasedSource
.end
- ending byte offset of the new FileBasedSource
. May be Long.MAX_VALUE,
in which case it will be inferred using FileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions)
.protected BlockBasedSource.BlockBasedReader<T> createSingleFileReader(PipelineOptions options)
BlockBasedSource
BlockBasedReader
.createSingleFileReader
in class BlockBasedSource<T>