public class AvroIO
extends java.lang.Object
PTransform
s for reading and writing Avro files.
To read a PCollection
from one or more Avro files, use AvroIO.read()
,
using AvroIO.Read.from(java.lang.String)
to specify the filename or filepattern to read from.
See FileSystems
for information on supported file systems and filepatterns.
To read specific records, such as Avro-generated classes, use read(Class)
.
To read GenericRecords
, use readGenericRecords(Schema)
which takes
a Schema
object, or readGenericRecords(String)
which takes an Avro schema in a
JSON-encoded string form. An exception will be thrown if a record doesn't match the specified
schema.
For example:
Pipeline p = ...;
// A simple Read of a local file (only runs locally):
PCollection<AvroAutoGenClass> records =
p.apply(AvroIO.read(AvroAutoGenClass.class).from("/path/to/file.avro"));
// A Read from a GCS file (runs locally and using remote execution):
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records =
p.apply(AvroIO.readGenericRecords(schema)
.from("gs://my_bucket/path/to/records-*.avro"));
To write a PCollection
to one or more Avro files, use AvroIO.Write
, using
AvroIO.write().to(String)
to specify the output filename prefix. The default
DefaultFilenamePolicy
will use this prefix, in conjunction with a
ShardNameTemplate
(set via AvroIO.Write.withShardNameTemplate(String)
) and optional
filename suffix (set via AvroIO.Write.withSuffix(String)
, to generate output filenames in a
sharded way. You can override this default write filename policy using
AvroIO.Write.withFilenamePolicy(FileBasedSink.FilenamePolicy)
to specify a custom file naming
policy.
By default, all input is put into the global window before writing. If per-window writes are
desired - for example, when using a streaming runner -
AvroIO.Write.withWindowedWrites()
will cause windowing and triggering to be
preserved. When producing windowed writes, the number of output shards must be set explicitly
using AvroIO.Write.withNumShards(int)
; some runners may set this for you to a
runner-chosen value, so you may need not set it yourself. A
FileBasedSink.FilenamePolicy
must be set, and unique windows and triggers must produce
unique filenames.
To write specific records, such as Avro-generated classes, use write(Class)
.
To write GenericRecords
, use either writeGenericRecords(Schema)
which takes a Schema
object, or writeGenericRecords(String)
which takes a schema
in a JSON-encoded string form. An exception will be thrown if a record doesn't match the
specified schema.
For example:
// A simple Write to a local file (only runs locally):
PCollection<AvroAutoGenClass> records = ...;
records.apply(AvroIO.write(AvroAutoGenClass.class).to("/path/to/file.avro"));
// A Write to a sharded GCS file (runs locally and using remote execution):
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records = ...;
records.apply("WriteToAvro", AvroIO.writeGenericRecords(schema)
.to("gs://my_bucket/path/to/numbers")
.withSuffix(".avro"));
By default, AvroIO.Write
produces output files that are compressed using the
CodecFactory.deflateCodec(6)
. This default can
be changed or overridden using AvroIO.Write.withCodec(org.apache.avro.file.CodecFactory)
.
Modifier and Type | Class and Description |
---|---|
static class |
AvroIO.Read<T>
Implementation of
read(java.lang.Class<T>) . |
static class |
AvroIO.Write<T>
Implementation of
write(java.lang.Class<T>) . |
Modifier and Type | Method and Description |
---|---|
static <T> AvroIO.Read<T> |
read(java.lang.Class<T> recordClass)
Reads records of the given type from an Avro file (or multiple Avro files matching a pattern).
|
static AvroIO.Read<GenericRecord> |
readGenericRecords(Schema schema)
Reads Avro file(s) containing records of the specified schema.
|
static AvroIO.Read<GenericRecord> |
readGenericRecords(java.lang.String schema)
Reads Avro file(s) containing records of the specified schema.
|
static <T> AvroIO.Write<T> |
write(java.lang.Class<T> recordClass)
Writes a
PCollection to an Avro file (or multiple Avro files matching a sharding
pattern). |
static AvroIO.Write<GenericRecord> |
writeGenericRecords(Schema schema)
Writes Avro records of the specified schema.
|
static AvroIO.Write<GenericRecord> |
writeGenericRecords(java.lang.String schema)
Writes Avro records of the specified schema.
|
public static <T> AvroIO.Read<T> read(java.lang.Class<T> recordClass)
The schema must be specified using one of the withSchema
functions.
public static AvroIO.Read<GenericRecord> readGenericRecords(Schema schema)
public static AvroIO.Read<GenericRecord> readGenericRecords(java.lang.String schema)
public static <T> AvroIO.Write<T> write(java.lang.Class<T> recordClass)
PCollection
to an Avro file (or multiple Avro files matching a sharding
pattern).public static AvroIO.Write<GenericRecord> writeGenericRecords(Schema schema)
public static AvroIO.Write<GenericRecord> writeGenericRecords(java.lang.String schema)