@Experimental(value=SOURCE_SINK) public class ParquetIO extends java.lang.Object
ParquetIO source returns a PCollection for
Parquet files. The elements in the PCollection are Avro GenericRecord.
To configure the ParquetIO.Read, you have to provide the file patterns (from) of the Parquet
files and the schema.
For example:
PCollection<GenericRecord> records = pipeline.apply(ParquetIO.read(SCHEMA).from("/foo/bar"));
...
As ParquetIO.Read is based on FileIO, it supports any filesystem (hdfs, ...).
For more advanced use cases, like reading each file in a PCollection
of FileIO.ReadableFile, use the ParquetIO.ReadFiles transform.
For example:
PCollection<FileIO.ReadableFile> files = pipeline
.apply(FileIO.match().filepattern(options.getInputFilepattern())
.apply(FileIO.readMatches());
PCollection<GenericRecord> output = files.apply(ParquetIO.readFiles(SCHEMA));
ParquetIO.Sink allows you to write a PCollection of GenericRecord
into a Parquet file. It can be used with the general-purpose FileIO transforms
with FileIO.write/writeDynamic specifically.
For example:
pipeline
.apply(...) // PCollection<GenericRecord>
.apply(FileIO.<GenericRecord>
.write()
.via(ParquetIO.sink(SCHEMA))
.to("destination/path")
This IO API is considered experimental and may break or receive backwards-incompatible changes in future versions of the Apache Beam SDK.
| Modifier and Type | Class and Description |
|---|---|
static class |
ParquetIO.Read
Implementation of
read(Schema). |
static class |
ParquetIO.ReadFiles
Implementation of
readFiles(Schema). |
static class |
ParquetIO.Sink
Implementation of
sink(org.apache.avro.Schema). |
| Modifier and Type | Method and Description |
|---|---|
static ParquetIO.Read |
read(Schema schema)
Reads
GenericRecord from a Parquet file (or multiple Parquet files matching
the pattern). |
static ParquetIO.ReadFiles |
readFiles(Schema schema)
Like
read(Schema), but reads each file in a PCollection
of FileIO.ReadableFile,
which allows more flexible usage. |
static ParquetIO.Sink |
sink(Schema schema)
Creates a
ParquetIO.Sink that, for use with FileIO.write(). |
public static ParquetIO.Read read(Schema schema)
GenericRecord from a Parquet file (or multiple Parquet files matching
the pattern).public static ParquetIO.ReadFiles readFiles(Schema schema)
read(Schema), but reads each file in a PCollection
of FileIO.ReadableFile,
which allows more flexible usage.public static ParquetIO.Sink sink(Schema schema)
ParquetIO.Sink that, for use with FileIO.write().