Class AvroSource<T>
- Type Parameters:
T- The type of records to be read from the source.
- All Implemented Interfaces:
Serializable,HasDisplayData
AvroIO.Read.
A FileBasedSource for reading Avro files.
To read a PCollection of objects from one or more Avro files, use from(org.apache.beam.sdk.options.ValueProvider<java.lang.String>) to specify the path(s) of the files to read. The AvroSource that is
returned will read objects of type GenericRecord with the schema(s) that were written at
file creation. To further configure the AvroSource to read with a user-defined schema, or
to return records of a type other than GenericRecord, use withSchema(Schema) (using an Avro Schema), withSchema(String) (using a JSON schema), or withSchema(Class) (to
return objects of the Avro-generated class specified).
An AvroSource can be read from using the Read transform. For example:
AvroSource<MyType> source = AvroSource.from(file.toPath()).withSchema(MyType.class);
PCollection<MyType> records = Read.from(mySource);
This class's implementation is based on the Avro 1.7.7 specification and implements parsing of some parts of Avro Object Container Files. The rationale for doing so is that the Avro API does not provide efficient ways of computing the precise offsets of blocks within a file, which is necessary to support dynamic work rebalancing. However, whenever it is possible to use the Avro API in a way that supports maintaining precise offsets, this class uses the Avro API.
Avro Object Container files store records in blocks. Each block contains a collection of records. Blocks may be encoded (e.g., with bzip2, deflate, snappy, etc.). Blocks are delineated from one another by a 16-byte sync marker.
An AvroSource for a subrange of a single file contains records in the blocks such that
the start offset of the block is greater than or equal to the start offset of the source and less
than the end offset of the source.
To use XZ-encoded Avro files, please include an explicit dependency on xz-1.8.jar,
which has been marked as optional in the Maven sdk/pom.xml.
<dependency>
<groupId>org.tukaani</groupId>
<artifactId>xz</artifactId>
<version>1.8</version>
</dependency>
Permissions
Permission requirements depend on the PipelineRunner that is used to execute the
pipeline. Please refer to the documentation of corresponding PipelineRunners for more
details.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classABlockBasedSource.BlockBasedReaderfor reading blocks from Avro files.static interfaceNested classes/interfaces inherited from class org.apache.beam.sdk.io.BlockBasedSource
BlockBasedSource.Block<T>, BlockBasedSource.BlockBasedReader<T>Nested classes/interfaces inherited from class org.apache.beam.sdk.io.FileBasedSource
FileBasedSource.FileBasedReader<T>Nested classes/interfaces inherited from class org.apache.beam.sdk.io.OffsetBasedSource
OffsetBasedSource.OffsetBasedReader<T>Nested classes/interfaces inherited from class org.apache.beam.sdk.io.BoundedSource
BoundedSource.BoundedReader<T>Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source
Source.Reader<T> -
Method Summary
Modifier and TypeMethodDescriptioncreateForSubrangeOfFile(String fileName, long start, long end) Deprecated.Used by Dataflow workercreateForSubrangeOfFile(MatchResult.Metadata fileMetadata, long start, long end) Creates aBlockBasedSourcefor the specified range in a single file.protected BlockBasedSource.BlockBasedReader<T> createSingleFileReader(PipelineOptions options) Creates aBlockBasedReader.static AvroSource<GenericRecord> Likefrom(ValueProvider).static AvroSource<GenericRecord> from(MatchResult.Metadata metadata) static AvroSource<GenericRecord> from(ValueProvider<String> fileNameOrPattern) Reads from the given file name or pattern ("glob").Returns theCoderto use for the data read from this source.voidvalidate()Checks that this source is valid, before it can be used in a pipeline.Specifies the coder for the result of theAvroSource.withDatumReaderFactory(AvroSource.DatumReaderFactory<?> factory) Sets a customAvroSource.DatumReaderFactoryfor reading.withEmptyMatchTreatment(EmptyMatchTreatment emptyMatchTreatment) withMinBundleSize(long minBundleSize) Sets the minimum bundle size.<X> AvroSource<X> withParseFn(SerializableFunction<GenericRecord, X> parseFn, Coder<X> coder) ReadsGenericRecordof unspecified schema and maps them to instances of a custom type using the givenparseFnand encoded using the given coder.<X> AvroSource<X> withSchema(Class<X> clazz) Reads files containing records of the given class.withSchema(String schema) Reads files containing records that conform to the given schema.withSchema(Schema schema) LikewithSchema(String).Methods inherited from class org.apache.beam.sdk.io.FileBasedSource
createReader, createSourceForSubrange, getEmptyMatchTreatment, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, isSplittable, populateDisplayData, split, toStringMethods inherited from class org.apache.beam.sdk.io.OffsetBasedSource
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffsetMethods inherited from class org.apache.beam.sdk.io.Source
getDefaultOutputCoder
-
Method Details
-
from
Reads from the given file name or pattern ("glob"). The returned source needs to be further configured by callingwithSchema(java.lang.String)to return a type other thanGenericRecord. -
from
-
from
Likefrom(ValueProvider). -
withEmptyMatchTreatment
-
withSchema
Reads files containing records that conform to the given schema. -
withSchema
LikewithSchema(String). -
withSchema
Reads files containing records of the given class. -
withParseFn
ReadsGenericRecordof unspecified schema and maps them to instances of a custom type using the givenparseFnand encoded using the given coder. -
withMinBundleSize
Sets the minimum bundle size. Refer toOffsetBasedSourcefor a description ofminBundleSizeand its use. -
withDatumReaderFactory
Sets a customAvroSource.DatumReaderFactoryfor reading. Pass aAvroDatumFactoryto also use the factory for theAvroCoder -
withCoder
Specifies the coder for the result of theAvroSource. -
validate
public void validate()Description copied from class:SourceChecks that this source is valid, before it can be used in a pipeline.It is recommended to use
Preconditionsfor implementing this method.- Overrides:
validatein classFileBasedSource<T>
-
createForSubrangeOfFile
@Deprecated public BlockBasedSource<T> createForSubrangeOfFile(String fileName, long start, long end) throws IOException Deprecated.Used by Dataflow workerUsed by the Dataflow worker. Do not introduce new usages. Do not delete without confirming that Dataflow ValidatesRunner tests pass.- Throws:
IOException
-
createForSubrangeOfFile
public BlockBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata fileMetadata, long start, long end) Description copied from class:BlockBasedSourceCreates aBlockBasedSourcefor the specified range in a single file.- Specified by:
createForSubrangeOfFilein classBlockBasedSource<T>- Parameters:
fileMetadata- file backing the newFileBasedSource.start- starting byte offset of the newFileBasedSource.end- ending byte offset of the newFileBasedSource. May be Long.MAX_VALUE, in which case it will be inferred usingFileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions).
-
createSingleFileReader
Description copied from class:BlockBasedSourceCreates aBlockBasedReader.- Specified by:
createSingleFileReaderin classBlockBasedSource<T>
-
getOutputCoder
Description copied from class:SourceReturns theCoderto use for the data read from this source.- Overrides:
getOutputCoderin classSource<T>
-