Class ContextualTextIO
PTransform
s that read text files and collect contextual information of the elements in
the input.
Prefer TextIO
when not reading files with multi-line records or additional record
metadata is not required.
Reading from text files
To read a PCollection
from one or more text files, use
ContextualTextIO.read()
. To instantiate a transform use ContextualTextIO.Read.from(String)
and specify the path of the file(s) to be read.
Alternatively, if the filenames to be read are themselves in a PCollection
you can use
FileIO
to match them and readFiles()
to read them.
read()
returns a PCollection
of Row
s with schema RecordWithMetadata.getSchema()
, each corresponding to one line of an input UTF-8 text file
(split into lines delimited by '\n', '\r', '\r\n', or specified delimiter via ContextualTextIO.Read.withDelimiter(byte[])
).
Filepattern expansion and watching
By default, the filepatterns are expanded only once. The combination of FileIO.Match.continuously(Duration, TerminationCondition)
and readFiles()
allow
streaming of new files matching the filepattern(s).
By default, read()
prohibits filepatterns that match no files, and readFiles()
allows them in case the filepattern contains a glob wildcard character. Use ContextualTextIO.Read.withEmptyMatchTreatment(org.apache.beam.sdk.io.fs.EmptyMatchTreatment)
or FileIO.Match.withEmptyMatchTreatment(EmptyMatchTreatment)
plus readFiles()
to configure
this behavior.
Example 1: reading a file or filepattern.
Pipeline p = ...;
// A simple Read of a file:
PCollection<Row> records = p.apply(ContextualTextIO.read().from("/local/path/to/file.txt"));
Example 2: reading a PCollection of filenames.
Pipeline p = ...;
// E.g. the filenames might be computed from other data in the pipeline, or
// read from a data source.
PCollection<String> filenames = ...;
// Read all files in the collection.
PCollection<Row> records =
filenames
.apply(FileIO.matchAll())
.apply(FileIO.readMatches())
.apply(ContextualTextIO.readFiles());
Example 3: streaming new files matching a filepattern.
Pipeline p = ...;
PCollection<Row> records = p.apply(ContextualTextIO.read()
.from("/local/path/to/files/*")
.watchForNewFiles(
// Check for new files every minute
Duration.standardMinutes(1),
// Stop watching the filepattern if no new files appear within an hour
afterTimeSinceNewOutput(Duration.standardHours(1))));
Example 4: reading a file or file pattern of RFC4180-compliant CSV files with fields that may contain line breaks.
Example of such a file could be:
"aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx
Pipeline p = ...;
PCollection<Row> records = p.apply(ContextualTextIO.read()
.from("/local/path/to/files/*.csv")
.withHasMultilineCSVRecords(true));
Example 5: reading while watching for new files
Pipeline p = ...;
PCollection<Row> records = p.apply(FileIO.match()
.filepattern("filepattern")
.continuously(
Duration.millis(100),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(3))))
.apply(FileIO.readMatches())
.apply(ContextualTextIO.readFiles());
Example 6: reading with recordNum metadata.
Pipeline p = ...;
PCollection<Row> records = p.apply(ContextualTextIO.read()
.from("/local/path/to/files/*.csv")
.setWithRecordNumMetadata(true));
NOTE: When using ContextualTextIO.Read.withHasMultilineCSVRecords(Boolean)
, a single
reader will be used to process the file, rather than multiple readers which can read from
different offsets. For a large file this can result in lower performance.
NOTE: Use ContextualTextIO.Read.withRecordNumMetadata()
when recordNum metadata is required. Computing
absolute record positions currently introduces a grouping step, which increases the resources
used by the pipeline. By default withRecordNumMetadata is set to false, in this case record
objects will not contain absolute record positions within the entire file, but will still contain
relative positions in respective offsets.
Reading a very large number of files
If it is known that the filepattern will match a very large number of files (e.g. tens of
thousands or more), use ContextualTextIO.Read.withHintMatchesManyFiles()
for better
performance and scalability. Note that it may decrease performance if the filepattern matches
only a small number of files.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Implementation ofread()
.static class
Implementation ofreadFiles()
. -
Method Summary
Modifier and TypeMethodDescriptionstatic ContextualTextIO.Read
read()
APTransform
that reads from one or more text files and returns a boundedPCollection
containing oneelement
for each line in the input files.static ContextualTextIO.ReadFiles
Likeread()
, but reads each file in aPCollection
ofFileIO.ReadableFile
, returned byFileIO.readMatches()
.
-
Method Details
-
read
APTransform
that reads from one or more text files and returns a boundedPCollection
containing oneelement
for each line in the input files. -
readFiles
Likeread()
, but reads each file in aPCollection
ofFileIO.ReadableFile
, returned byFileIO.readMatches()
.
-