Class XmlIO
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Implementation ofread()
.static class
Implementation ofreadFiles()
.static class
Implementation ofsink(java.lang.Class<T>)
.static class
Implementation ofwrite()
. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic <T> XmlIO.Read
<T> read()
Reads XML files as aPCollection
of a given type mapped via JAXB.static <T> XmlIO.ReadFiles
<T> Likeread()
, but reads each file in aPCollection
ofFileIO.ReadableFile
, which allows more flexible usage via different configuration options ofFileIO.match()
andFileIO.readMatches()
that are not explicitly provided forread()
.static <T> XmlIO.Sink
<T> Outputs records as XML-formatted elements using JAXB.static <T> XmlIO.Write
<T> write()
Writes all elements in the inputPCollection
to a single XML file usingsink(java.lang.Class<T>)
.
-
Constructor Details
-
XmlIO
public XmlIO()
-
-
Method Details
-
read
Reads XML files as aPCollection
of a given type mapped via JAXB.The XML files must be of the following form, where
root
andrecord
are XML element names that are defined by the user:<root> <record> ... </record> <record> ... </record> <record> ... </record> ... <record> ... </record> </root>
Basically, the XML document should contain a single root element with an inner list consisting entirely of record elements. The records may contain arbitrary XML content; however, that content must not contain the start
<record>
or end</record>
tags. This restriction enables reading from large XML files in parallel from different offsets in the file.Root and/or record elements may additionally contain an arbitrary number of XML attributes. Additionally users must provide a class of a JAXB annotated Java type that can be used convert records into Java objects and vice versa using JAXB marshalling/unmarshalling mechanisms. Reading the source will generate a
PCollection
of the given JAXB annotated Java type. Optionally users may provide a minimum size of a bundle that should be created for the source.Example:
PCollection<Record> output = p.apply(XmlIO.<Record>read() .from(file.toPath().toString()) .withRootElement("root") .withRecordElement("record") .withRecordClass(Record.class));
By default, UTF-8 charset is used. To specify a different charset, use
XmlIO.Read.withCharset(java.nio.charset.Charset)
.Currently, only XML files that use single-byte characters are supported. Using a file that contains multi-byte characters may result in data loss or duplication.
- Type Parameters:
T
- Type of the objects that represent the records of the XML file. ThePCollection
generated by this source will be of this type.
-
readFiles
Likeread()
, but reads each file in aPCollection
ofFileIO.ReadableFile
, which allows more flexible usage via different configuration options ofFileIO.match()
andFileIO.readMatches()
that are not explicitly provided forread()
.For example:
PCollection<ReadableFile> files = p .apply(FileIO.match().filepattern(options.getInputFilepatternProvider()).continuously( Duration.standardSeconds(30), afterTimeSinceNewOutput(Duration.standardMinutes(5)))) .apply(FileIO.readMatches().withCompression(GZIP)); PCollection<Record> output = files.apply(XmlIO.<Record>readFiles() .withRootElement("root") .withRecordElement("record") .withRecordClass(Record.class));
-
write
Writes all elements in the inputPCollection
to a single XML file usingsink(java.lang.Class<T>)
.For more configurable usage, use
sink(java.lang.Class<T>)
directly withFileIO.write()
orFileIO.writeDynamic()
. -
sink
Outputs records as XML-formatted elements using JAXB.The produced file consists of a single root element containing 1 sub-element per element written to the sink.
The given class will be used in the marshalling of records in an input PCollection to their XML representation and must be able to be bound using JAXB annotations.
For example, consider the following class with JAXB annotations:
@XmlRootElement(name = "word_count_result") @XmlType(propOrder = {"word", "frequency"}) public class WordFrequency { public String word; public long frequency; }
The following will produce XML output with a root element named "words" from a PCollection of WordFrequency objects:
p.apply(FileIO.<WordFrequency>write() .via(XmlIO.sink(WordFrequency.class).withRootElement("words")) .to(prefixAndShardTemplate("...", DEFAULT_UNWINDOWED_SHARD_TEMPLATE + ".xml"));
The output will look like:
<words> <word_count_result> <word>decreased</word> <frequency>1</frequency> </word_count_result> <word_count_result> <word>War</word> <frequency>4</frequency> </word_count_result> <word_count_result> <word>empress'</word> <frequency>14</frequency> </word_count_result> <word_count_result> <word>stoops</word> <frequency>6</frequency> </word_count_result> ... </words>
-