T - The type of records to be read from the source.public class AvroSource<T> extends BlockBasedSource<T>
AvroIO.Read.
A FileBasedSource for reading Avro files.
To read a PCollection of objects from one or more Avro files, use from(org.apache.beam.sdk.options.ValueProvider<java.lang.String>) to specify the path(s) of the files to read. The AvroSource that is
returned will read objects of type GenericRecord with the schema(s) that were written at
file creation. To further configure the AvroSource to read with a user-defined schema, or
to return records of a type other than GenericRecord, use withSchema(Schema) (using an Avro Schema), withSchema(String) (using a JSON schema), or withSchema(Class) (to
return objects of the Avro-generated class specified).
An AvroSource can be read from using the Read transform. For example:
AvroSource<MyType> source = AvroSource.from(file.toPath()).withSchema(MyType.class);
PCollection<MyType> records = Read.from(mySource);
This class's implementation is based on the Avro 1.7.7 specification and implements parsing of some parts of Avro Object Container Files. The rationale for doing so is that the Avro API does not provide efficient ways of computing the precise offsets of blocks within a file, which is necessary to support dynamic work rebalancing. However, whenever it is possible to use the Avro API in a way that supports maintaining precise offsets, this class uses the Avro API.
Avro Object Container files store records in blocks. Each block contains a collection of records. Blocks may be encoded (e.g., with bzip2, deflate, snappy, etc.). Blocks are delineated from one another by a 16-byte sync marker.
An AvroSource for a subrange of a single file contains records in the blocks such that
the start offset of the block is greater than or equal to the start offset of the source and less
than the end offset of the source.
To use XZ-encoded Avro files, please include an explicit dependency on xz-1.8.jar,
which has been marked as optional in the Maven sdk/pom.xml.
<dependency>
<groupId>org.tukaani</groupId>
<artifactId>xz</artifactId>
<version>1.8</version>
</dependency>
Permission requirements depend on the PipelineRunner that is used to execute the
pipeline. Please refer to the documentation of corresponding PipelineRunners for more
details.
| Modifier and Type | Class and Description |
|---|---|
static class |
AvroSource.AvroReader<T>
A
BlockBasedReader for reading blocks from Avro files. |
static interface |
AvroSource.DatumReaderFactory<T> |
BlockBasedSource.Block<T>, BlockBasedSource.BlockBasedReader<T>FileBasedSource.FileBasedReader<T>OffsetBasedSource.OffsetBasedReader<T>BoundedSource.BoundedReader<T>Source.Reader<T>| Modifier and Type | Method and Description |
|---|---|
BlockBasedSource<T> |
createForSubrangeOfFile(MatchResult.Metadata fileMetadata,
long start,
long end)
Creates a
BlockBasedSource for the specified range in a single file. |
BlockBasedSource<T> |
createForSubrangeOfFile(java.lang.String fileName,
long start,
long end)
Deprecated.
Used by Dataflow worker
|
protected BlockBasedSource.BlockBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates a
BlockBasedReader. |
static AvroSource<GenericRecord> |
from(MatchResult.Metadata metadata) |
static AvroSource<GenericRecord> |
from(java.lang.String fileNameOrPattern)
Like
from(ValueProvider). |
static AvroSource<GenericRecord> |
from(ValueProvider<java.lang.String> fileNameOrPattern)
Reads from the given file name or pattern ("glob").
|
Coder<T> |
getOutputCoder()
Returns the
Coder to use for the data read from this source. |
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
AvroSource<T> |
withCoder(Coder<T> coder)
Specifies the coder for the result of the
AvroSource. |
AvroSource<T> |
withDatumReaderFactory(AvroSource.DatumReaderFactory<?> factory)
Sets a custom
AvroSource.DatumReaderFactory for reading. |
AvroSource<T> |
withEmptyMatchTreatment(EmptyMatchTreatment emptyMatchTreatment) |
AvroSource<T> |
withMinBundleSize(long minBundleSize)
Sets the minimum bundle size.
|
<X> AvroSource<X> |
withParseFn(SerializableFunction<GenericRecord,X> parseFn,
Coder<X> coder)
Reads
GenericRecord of unspecified schema and maps them to instances of a custom type
using the given parseFn and encoded using the given coder. |
<X> AvroSource<X> |
withSchema(java.lang.Class<X> clazz)
Reads files containing records of the given class.
|
AvroSource<GenericRecord> |
withSchema(Schema schema)
Like
withSchema(String). |
AvroSource<GenericRecord> |
withSchema(java.lang.String schema)
Reads files containing records that conform to the given schema.
|
createReader, createSourceForSubrange, getEmptyMatchTreatment, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, isSplittable, populateDisplayData, split, toStringgetBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffsetgetDefaultOutputCoderpublic static AvroSource<GenericRecord> from(ValueProvider<java.lang.String> fileNameOrPattern)
withSchema(java.lang.String) to return a type other than GenericRecord.public static AvroSource<GenericRecord> from(MatchResult.Metadata metadata)
public static AvroSource<GenericRecord> from(java.lang.String fileNameOrPattern)
from(ValueProvider).public AvroSource<T> withEmptyMatchTreatment(EmptyMatchTreatment emptyMatchTreatment)
public AvroSource<GenericRecord> withSchema(java.lang.String schema)
public AvroSource<GenericRecord> withSchema(Schema schema)
withSchema(String).public <X> AvroSource<X> withSchema(java.lang.Class<X> clazz)
public <X> AvroSource<X> withParseFn(SerializableFunction<GenericRecord,X> parseFn, Coder<X> coder)
GenericRecord of unspecified schema and maps them to instances of a custom type
using the given parseFn and encoded using the given coder.public AvroSource<T> withMinBundleSize(long minBundleSize)
OffsetBasedSource for a description of minBundleSize and its use.public AvroSource<T> withDatumReaderFactory(AvroSource.DatumReaderFactory<?> factory)
AvroSource.DatumReaderFactory for reading. Pass a AvroDatumFactory to also use the factory for the AvroCoderpublic AvroSource<T> withCoder(Coder<T> coder)
AvroSource.public void validate()
SourceIt is recommended to use Preconditions for implementing this method.
validate in class FileBasedSource<T>@Deprecated public BlockBasedSource<T> createForSubrangeOfFile(java.lang.String fileName, long start, long end) throws java.io.IOException
java.io.IOExceptionpublic BlockBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata fileMetadata, long start, long end)
BlockBasedSourceBlockBasedSource for the specified range in a single file.createForSubrangeOfFile in class BlockBasedSource<T>fileMetadata - file backing the new FileBasedSource.start - starting byte offset of the new FileBasedSource.end - ending byte offset of the new FileBasedSource. May be Long.MAX_VALUE, in
which case it will be inferred using FileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions).protected BlockBasedSource.BlockBasedReader<T> createSingleFileReader(PipelineOptions options)
BlockBasedSourceBlockBasedReader.createSingleFileReader in class BlockBasedSource<T>