@Experimental(value=SOURCE_SINK) public class ParquetIO extends java.lang.Object
ParquetIO
source returns a PCollection
for Parquet files. The elements in the
PCollection
are Avro GenericRecord
.
To configure the ParquetIO.Read
, you have to provide the file patterns (from) of the Parquet
files and the schema.
For example:
PCollection<GenericRecord> records = pipeline.apply(ParquetIO.read(SCHEMA).from("/foo/bar"));
...
As ParquetIO.Read
is based on FileIO
, it supports any filesystem (hdfs, ...).
For more advanced use cases, like reading each file in a PCollection
of FileIO.ReadableFile
, use the ParquetIO.ReadFiles
transform.
For example:
PCollection<FileIO.ReadableFile> files = pipeline
.apply(FileIO.match().filepattern(options.getInputFilepattern())
.apply(FileIO.readMatches());
PCollection<GenericRecord> output = files.apply(ParquetIO.readFiles(SCHEMA));
ParquetIO.Sink
allows you to write a PCollection
of GenericRecord
into
a Parquet file. It can be used with the general-purpose FileIO
transforms with
FileIO.write/writeDynamic specifically.
By default, ParquetIO.Sink
produces output files that are compressed using the CompressionCodec.SNAPPY
. This default can be changed or overridden
using ParquetIO.Sink.withCompressionCodec(CompressionCodecName)
.
For example:
pipeline
.apply(...) // PCollection<GenericRecord>
.apply(FileIO
.<GenericRecord>write()
.via(ParquetIO.sink(SCHEMA)
.withCompressionCodec(CompressionCodecName.SNAPPY))
.to("destination/path"))
This IO API is considered experimental and may break or receive backwards-incompatible changes in future versions of the Apache Beam SDK.
Modifier and Type | Class and Description |
---|---|
static class |
ParquetIO.Read
Implementation of
read(Schema) . |
static class |
ParquetIO.ReadFiles
Implementation of
readFiles(Schema) . |
static class |
ParquetIO.Sink
Implementation of
sink(org.apache.avro.Schema) . |
Modifier and Type | Method and Description |
---|---|
static ParquetIO.Read |
read(Schema schema)
Reads
GenericRecord from a Parquet file (or multiple Parquet files matching the
pattern). |
static ParquetIO.ReadFiles |
readFiles(Schema schema)
Like
read(Schema) , but reads each file in a PCollection of FileIO.ReadableFile , which allows more flexible usage. |
static ParquetIO.Sink |
sink(Schema schema)
Creates a
ParquetIO.Sink that, for use with FileIO.write() . |
public static ParquetIO.Read read(Schema schema)
GenericRecord
from a Parquet file (or multiple Parquet files matching the
pattern).public static ParquetIO.ReadFiles readFiles(Schema schema)
read(Schema)
, but reads each file in a PCollection
of FileIO.ReadableFile
, which allows more flexible usage.public static ParquetIO.Sink sink(Schema schema)
ParquetIO.Sink
that, for use with FileIO.write()
.