Class TikaIO
Tika is able to extract text and metadata from files in many well known text, binary and scientific formats.
The entry points are parse()
and parseFiles()
. They parse a set of files and
return a PCollection
containing one ParseResult
per each file. parse()
implements the common case of parsing all files matching a single filepattern, while parseFiles()
should be used for all use cases requiring more control, in combination with FileIO.match()
and FileIO.readMatches()
(see their respective documentation).
parse()
does not automatically uncompress compressed files: they are passed to Tika
as-is.
It's possible that some files will partially or completely fail to parse. In that case, the
respective ParseResult
will be marked unsuccessful (see ParseResult.isSuccess()
)
and will contain the error, available via ParseResult.getError()
.
Example: using parse()
to parse all PDF files in a directory on GCS.
Pipeline p = ...;
PCollection<ParseResult> results =
p.apply(TikaIO.parse().filepattern("gs://my-bucket/files/*.pdf"));
Example: using parseFiles()
in combination with FileIO
to continuously parse
new PDF files arriving into the directory.
Pipeline p = ...;
PCollection<ParseResult> results =
p.apply(FileIO.match().filepattern("gs://my-bucket/files/*.pdf")
.continuously(...))
.apply(FileIO.readMatches())
.apply(TikaIO.parseFiles());
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Implementation ofparse()
.static class
Implementation ofparseFiles()
. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic TikaIO.Parse
parse()
Parses files matching a given filepattern.static TikaIO.ParseFiles
Parses files in aPCollection
ofFileIO.ReadableFile
.
-
Constructor Details
-
TikaIO
public TikaIO()
-
-
Method Details
-
parse
Parses files matching a given filepattern. -
parseFiles
Parses files in aPCollection
ofFileIO.ReadableFile
.
-