public class AvroIO
extends java.lang.Object
PTransforms for reading and writing Avro files.
To read a PCollection from one or more Avro files, use AvroIO.read(),
using AvroIO.Read.from(java.lang.String) to specify the filename or filepattern to read from.
See FileSystems for information on supported file systems and filepatterns.
To read specific records, such as Avro-generated classes, use read(Class).
To read GenericRecords, use readGenericRecords(Schema) which takes
a Schema object, or readGenericRecords(String) which takes an Avro schema in a
JSON-encoded string form. An exception will be thrown if a record doesn't match the specified
schema.
For example:
Pipeline p = ...;
// A simple Read of a local file (only runs locally):
PCollection<AvroAutoGenClass> records =
p.apply(AvroIO.read(AvroAutoGenClass.class).from("/path/to/file.avro"));
// A Read from a GCS file (runs locally and using remote execution):
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records =
p.apply(AvroIO.readGenericRecords(schema)
.from("gs://my_bucket/path/to/records-*.avro"));
To write a PCollection to one or more Avro files, use AvroIO.Write, using
AvroIO.write().to(String) to specify the output filename prefix. The default
DefaultFilenamePolicy will use this prefix, in conjunction with a
ShardNameTemplate (set via AvroIO.Write.withShardNameTemplate(String)) and optional
filename suffix (set via AvroIO.Write.withSuffix(String), to generate output filenames in a
sharded way. You can override this default write filename policy using
AvroIO.Write.withFilenamePolicy(FileBasedSink.FilenamePolicy) to specify a custom file naming
policy.
By default, all input is put into the global window before writing. If per-window writes are
desired - for example, when using a streaming runner -
AvroIO.Write.withWindowedWrites() will cause windowing and triggering to be
preserved. When producing windowed writes, the number of output shards must be set explicitly
using AvroIO.Write.withNumShards(int); some runners may set this for you to a
runner-chosen value, so you may need not set it yourself. A
FileBasedSink.FilenamePolicy must be set, and unique windows and triggers must produce
unique filenames.
To write specific records, such as Avro-generated classes, use write(Class).
To write GenericRecords, use either writeGenericRecords(Schema)
which takes a Schema object, or writeGenericRecords(String) which takes a schema
in a JSON-encoded string form. An exception will be thrown if a record doesn't match the
specified schema.
For example:
// A simple Write to a local file (only runs locally):
PCollection<AvroAutoGenClass> records = ...;
records.apply(AvroIO.write(AvroAutoGenClass.class).to("/path/to/file.avro"));
// A Write to a sharded GCS file (runs locally and using remote execution):
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records = ...;
records.apply("WriteToAvro", AvroIO.writeGenericRecords(schema)
.to("gs://my_bucket/path/to/numbers")
.withSuffix(".avro"));
By default, AvroIO.Write produces output files that are compressed using the
CodecFactory.deflateCodec(6). This default can
be changed or overridden using AvroIO.Write.withCodec(org.apache.avro.file.CodecFactory).
| Modifier and Type | Class and Description |
|---|---|
static class |
AvroIO.Read<T>
Implementation of
read(java.lang.Class<T>). |
static class |
AvroIO.Write<T>
Implementation of
write(java.lang.Class<T>). |
| Modifier and Type | Method and Description |
|---|---|
static <T> AvroIO.Read<T> |
read(java.lang.Class<T> recordClass)
Reads records of the given type from an Avro file (or multiple Avro files matching a pattern).
|
static AvroIO.Read<GenericRecord> |
readGenericRecords(Schema schema)
Reads Avro file(s) containing records of the specified schema.
|
static AvroIO.Read<GenericRecord> |
readGenericRecords(java.lang.String schema)
Reads Avro file(s) containing records of the specified schema.
|
static <T> AvroIO.Write<T> |
write(java.lang.Class<T> recordClass)
Writes a
PCollection to an Avro file (or multiple Avro files matching a sharding
pattern). |
static AvroIO.Write<GenericRecord> |
writeGenericRecords(Schema schema)
Writes Avro records of the specified schema.
|
static AvroIO.Write<GenericRecord> |
writeGenericRecords(java.lang.String schema)
Writes Avro records of the specified schema.
|
public static <T> AvroIO.Read<T> read(java.lang.Class<T> recordClass)
The schema must be specified using one of the withSchema functions.
public static AvroIO.Read<GenericRecord> readGenericRecords(Schema schema)
public static AvroIO.Read<GenericRecord> readGenericRecords(java.lang.String schema)
public static <T> AvroIO.Write<T> write(java.lang.Class<T> recordClass)
PCollection to an Avro file (or multiple Avro files matching a sharding
pattern).public static AvroIO.Write<GenericRecord> writeGenericRecords(Schema schema)
public static AvroIO.Write<GenericRecord> writeGenericRecords(java.lang.String schema)