@Experimental(value=SOURCE_SINK) public class TikaIO extends java.lang.Object
Tika is able to extract text and metadata from files in many well known text, binary and scientific formats.
The entry points are parse()
and parseFiles()
. They parse a set of files and
return a PCollection
containing one ParseResult
per each file. parse()
implements the common case of parsing all files matching a single filepattern, while parseFiles()
should be used for all use cases requiring more control, in combination with FileIO.match()
and FileIO.readMatches()
(see their respective documentation).
parse()
does not automatically uncompress compressed files: they are passed to Tika
as-is.
It's possible that some files will partially or completely fail to parse. In that case, the
respective ParseResult
will be marked unsuccessful (see ParseResult.isSuccess()
)
and will contain the error, available via ParseResult.getError()
.
Example: using parse()
to parse all PDF files in a directory on GCS.
Pipeline p = ...;
PCollection<ParseResult> results =
p.apply(TikaIO.parse().filepattern("gs://my-bucket/files/*.pdf"));
Example: using parseFiles()
in combination with FileIO
to continuously parse
new PDF files arriving into the directory.
Pipeline p = ...;
PCollection<ParseResult> results =
p.apply(FileIO.match().filepattern("gs://my-bucket/files/*.pdf")
.continuously(...))
.apply(FileIO.readMatches())
.apply(TikaIO.parseFiles());
Modifier and Type | Class and Description |
---|---|
static class |
TikaIO.Parse
Implementation of
parse() . |
static class |
TikaIO.ParseFiles
Implementation of
parseFiles() . |
Constructor and Description |
---|
TikaIO() |
Modifier and Type | Method and Description |
---|---|
static TikaIO.Parse |
parse()
Parses files matching a given filepattern.
|
static TikaIO.ParseFiles |
parseFiles()
Parses files in a
PCollection of FileIO.ReadableFile . |
public static TikaIO.Parse parse()
public static TikaIO.ParseFiles parseFiles()
PCollection
of FileIO.ReadableFile
.