Class TikaIO

java.lang.Object
org.apache.beam.sdk.io.tika.TikaIO

public class TikaIO extends Object
Transforms for parsing arbitrary files using Apache Tika.

Tika is able to extract text and metadata from files in many well known text, binary and scientific formats.

The entry points are parse() and parseFiles(). They parse a set of files and return a PCollection containing one ParseResult per each file. parse() implements the common case of parsing all files matching a single filepattern, while parseFiles() should be used for all use cases requiring more control, in combination with FileIO.match() and FileIO.readMatches() (see their respective documentation).

parse() does not automatically uncompress compressed files: they are passed to Tika as-is.

It's possible that some files will partially or completely fail to parse. In that case, the respective ParseResult will be marked unsuccessful (see ParseResult.isSuccess()) and will contain the error, available via ParseResult.getError().

Example: using parse() to parse all PDF files in a directory on GCS.


 Pipeline p = ...;

 PCollection<ParseResult> results =
   p.apply(TikaIO.parse().filepattern("gs://my-bucket/files/*.pdf"));
 

Example: using parseFiles() in combination with FileIO to continuously parse new PDF files arriving into the directory.


 Pipeline p = ...;

 PCollection<ParseResult> results =
   p.apply(FileIO.match().filepattern("gs://my-bucket/files/*.pdf")
       .continuously(...))
    .apply(FileIO.readMatches())
    .apply(TikaIO.parseFiles());