T - The type of records to be read from the source.@Experimental(value=SOURCE_SINK) public class AvroSource<T> extends BlockBasedSource<T>
AvroIO.Read.
A FileBasedSource for reading Avro files.
To read a PCollection of objects from one or more Avro files, use
from(java.lang.String) to specify the path(s) of the files to read. The AvroSource that
is returned will read objects of type GenericRecord with the schema(s) that were written
at file creation. To further configure the AvroSource to read with a user-defined schema,
or to return records of a type other than GenericRecord, use
withSchema(Schema) (using an Avro Schema),
withSchema(String) (using a JSON schema), or
withSchema(Class) (to return objects of the Avro-generated class specified).
An AvroSource can be read from using the Read transform. For example:
AvroSource<MyType> source = AvroSource.from(file.toPath()).withSchema(MyType.class);
PCollection<MyType> records = Read.from(mySource);
This class's implementation is based on the Avro 1.7.7 specification and implements parsing of some parts of Avro Object Container Files. The rationale for doing so is that the Avro API does not provide efficient ways of computing the precise offsets of blocks within a file, which is necessary to support dynamic work rebalancing. However, whenever it is possible to use the Avro API in a way that supports maintaining precise offsets, this class uses the Avro API.
Avro Object Container files store records in blocks. Each block contains a collection of records. Blocks may be encoded (e.g., with bzip2, deflate, snappy, etc.). Blocks are delineated from one another by a 16-byte sync marker.
An AvroSource for a subrange of a single file contains records in the blocks such that
the start offset of the block is greater than or equal to the start offset of the source and less
than the end offset of the source.
To use XZ-encoded Avro files, please include an explicit dependency on xz-1.5.jar,
which has been marked as optional in the Maven sdk/pom.xml.
<dependency>
<groupId>org.tukaani</groupId>
<artifactId>xz</artifactId>
<version>1.5</version>
</dependency>
Permission requirements depend on the PipelineRunner that is used to execute the
pipeline. Please refer to the documentation of corresponding PipelineRunners for
more details.
| Modifier and Type | Class and Description |
|---|---|
static class |
AvroSource.AvroReader<T>
A
BlockBasedSource.BlockBasedReader for reading blocks from Avro files. |
BlockBasedSource.Block<T>, BlockBasedSource.BlockBasedReader<T>FileBasedSource.FileBasedReader<T>, FileBasedSource.ModeOffsetBasedSource.OffsetBasedReader<T>BoundedSource.BoundedReader<T>Source.Reader<T>| Modifier and Type | Method and Description |
|---|---|
BlockBasedSource<T> |
createForSubrangeOfFile(MatchResult.Metadata fileMetadata,
long start,
long end)
Creates a
BlockBasedSource for the specified range in a single file. |
BlockBasedSource<T> |
createForSubrangeOfFile(java.lang.String fileName,
long start,
long end)
Deprecated.
|
protected BlockBasedSource.BlockBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates a
BlockBasedReader. |
static AvroSource<GenericRecord> |
from(java.lang.String fileNameOrPattern)
Creates an
AvroSource that reads from the given file name or pattern ("glob"). |
AvroCoder<T> |
getDefaultOutputCoder()
Returns the default
Coder to use for the data read from this source. |
java.lang.String |
getSchema() |
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
AvroSource<T> |
withMinBundleSize(long minBundleSize)
Returns an
AvroSource that's like this one but uses the supplied minimum bundle size. |
<X> AvroSource<X> |
withSchema(java.lang.Class<X> clazz)
Returns an
AvroSource that's like this one but reads files containing records of the
type of the given class. |
AvroSource<GenericRecord> |
withSchema(Schema schema)
Returns an
AvroSource that's like this one but reads files containing records that
conform to the given schema. |
AvroSource<GenericRecord> |
withSchema(java.lang.String schema)
Returns an
AvroSource that's like this one but reads files containing records that
conform to the given schema. |
createReader, createSourceForSubrange, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, isSplittable, populateDisplayData, split, toStringgetBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffsetpublic static AvroSource<GenericRecord> from(java.lang.String fileNameOrPattern)
AvroSource that reads from the given file name or pattern ("glob"). The
returned source can be further configured by calling withSchema(java.lang.String) to return a type other
than GenericRecord.public AvroSource<GenericRecord> withSchema(java.lang.String schema)
AvroSource that's like this one but reads files containing records that
conform to the given schema.
Does not modify this object.
public AvroSource<GenericRecord> withSchema(Schema schema)
AvroSource that's like this one but reads files containing records that
conform to the given schema.
Does not modify this object.
public <X> AvroSource<X> withSchema(java.lang.Class<X> clazz)
AvroSource that's like this one but reads files containing records of the
type of the given class.
Does not modify this object.
public AvroSource<T> withMinBundleSize(long minBundleSize)
AvroSource that's like this one but uses the supplied minimum bundle size.
Refer to OffsetBasedSource for a description of minBundleSize and its use.
Does not modify this object.
public void validate()
SourceIt is recommended to use Preconditions for implementing
this method.
validate in class FileBasedSource<T>@Deprecated public BlockBasedSource<T> createForSubrangeOfFile(java.lang.String fileName, long start, long end) throws java.io.IOException
java.io.IOExceptionpublic BlockBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata fileMetadata, long start, long end)
BlockBasedSourceBlockBasedSource for the specified range in a single file.createForSubrangeOfFile in class BlockBasedSource<T>fileMetadata - file backing the new FileBasedSource.start - starting byte offset of the new FileBasedSource.end - ending byte offset of the new FileBasedSource. May be Long.MAX_VALUE,
in which case it will be inferred using FileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions).protected BlockBasedSource.BlockBasedReader<T> createSingleFileReader(PipelineOptions options)
BlockBasedSourceBlockBasedReader.createSingleFileReader in class BlockBasedSource<T>public AvroCoder<T> getDefaultOutputCoder()
SourceCoder to use for the data read from this source.getDefaultOutputCoder in class Source<T>public java.lang.String getSchema()