public class XmlIO
extends java.lang.Object
Modifier and Type | Class and Description |
---|---|
static class |
XmlIO.Read<T>
Implementation of
read() . |
static class |
XmlIO.Write<T>
Implementation of
write() . |
Constructor and Description |
---|
XmlIO() |
Modifier and Type | Method and Description |
---|---|
static <T> XmlIO.Read<T> |
read()
Reads XML files.
|
static <T> XmlIO.Write<T> |
write()
A
FileBasedSink that outputs records as XML-formatted elements. |
public static <T> XmlIO.Read<T> read()
PCollection
of a
given type. Please note the example given below.
The XML file must be of the following form, where root
and record
are XML
element names that are defined by the user:
<root>
<record> ... </record>
<record> ... </record>
<record> ... </record>
...
<record> ... </record>
</root>
Basically, the XML document should contain a single root element with an inner list
consisting entirely of record elements. The records may contain arbitrary XML content; however,
that content must not contain the start <record>
or end </record>
tags.
This restriction enables reading from large XML files in parallel from different offsets in the
file.
Root and/or record elements may additionally contain an arbitrary number of XML attributes.
Additionally users must provide a class of a JAXB annotated Java type that can be used convert
records into Java objects and vice versa using JAXB marshalling/unmarshalling mechanisms.
Reading the source will generate a PCollection
of the given JAXB annotated Java type.
Optionally users may provide a minimum size of a bundle that should be created for the source.
The following example shows how to use this method in a Beam pipeline:
PCollection<String> output = p.apply(XmlIO.<Record>read()
.from(file.toPath().toString())
.withRootElement("root")
.withRecordElement("record")
.withRecordClass(Record.class));
By default, UTF-8 charset is used. If your file is using a different charset, you have to specify the following:
PCollection<String> output = p.apply(XmlIO.<Record>read()
.from(file.toPath().toString())
.withRooElement("root")
.withRecordElement("record")
.withRecordClass(Record.class)
.withCharset(StandardCharsets.ISO_8859_1));
StandardCharsets
provides static references to common charsets.
Currently, only XML files that use single-byte characters are supported. Using a file that contains multi-byte characters may result in data loss or duplication.
Permission requirements depend on the PipelineRunner
that is used to execute the Beam pipeline. Please refer to the documentation of
corresponding PipelineRunners
for more details.
T
- Type of the objects that represent the records of the XML file. The PCollection
generated by this source will be of this type.public static <T> XmlIO.Write<T> write()
FileBasedSink
that outputs records as XML-formatted elements. Writes a PCollection
of records from JAXB-annotated classes to a single file location.
Given a PCollection containing records of type T that can be marshalled to XML elements, this Sink will produce a single file consisting of a single root element that contains all of the elements in the PCollection.
XML Sinks are created with a base filename to write to, a root element name that will be used for the root element of the output files, and a class to bind to an XML element. This class will be used in the marshalling of records in an input PCollection to their XML representation and must be able to be bound using JAXB annotations (checked at pipeline construction time).
XML Sinks can be written to using the XmlIO.Write
transform:
p.apply(XmlIO.<Type>write()
.withRecordClass(Type.class)
.withRootElement(root_element)
.toFilenamePrefix(output_filename));
For example, consider the following class with JAXB annotations:
@XmlRootElement(name = "word_count_result") @XmlType(propOrder = {"word", "frequency"}) public class WordFrequency { private String word; private long frequency; public WordFrequency() { } public WordFrequency(String word, long frequency) { this.word = word; this.frequency = frequency; } public void setWord(String word) { this.word = word; } public void setFrequency(long frequency) { this.frequency = frequency; } public long getFrequency() { return frequency; } public String getWord() { return word; } }
The following will produce XML output with a root element named "words" from a PCollection of WordFrequency objects:
p.apply(XmlIO.<WordFrequency>write()
.withRecordClass(WordFrequency.class)
.withRootElement("words")
.toFilenamePrefix(output_file));
The output of which will look like:
<words>
<word_count_result>
<word>decreased</word>
<frequency>1</frequency>
</word_count_result>
<word_count_result>
<word>War</word>
<frequency>4</frequency>
</word_count_result>
<word_count_result>
<word>empress'</word>
<frequency>14</frequency>
</word_count_result>
<word_count_result>
<word>stoops</word>
<frequency>6</frequency>
</word_count_result>
...
</words>
By default the UTF-8 charset is used. This can be overridden, for example:
p.apply(XmlIO.<Type>write()
.withRecordClass(Type.class)
.withRootElement(root_element)
.withCharset(StandardCharsets.ISO_8859_1)
.toFilenamePrefix(output_filename));