org.apache.beam.sdk.io.tika.TikaIO

public class TikaIO extends Object

Transforms for parsing arbitrary files using Apache Tika.

Tika is able to extract text and metadata from files in many well known text, binary and scientific formats.

The entry points are parse() and parseFiles(). They parse a set of files and return a PCollection containing one ParseResult per each file. parse() implements the common case of parsing all files matching a single filepattern, while parseFiles() should be used for all use cases requiring more control, in combination with FileIO.match() and FileIO.readMatches() (see their respective documentation).

parse() does not automatically uncompress compressed files: they are passed to Tika as-is.

It's possible that some files will partially or completely fail to parse. In that case, the respective ParseResult will be marked unsuccessful (see ParseResult.isSuccess()) and will contain the error, available via ParseResult.getError().

Example: using parse() to parse all PDF files in a directory on GCS.


 Pipeline p = ...;

 PCollection<ParseResult> results =
   p.apply(TikaIO.parse().filepattern("gs://my-bucket/files/*.pdf"));

Example: using parseFiles() in combination with FileIO to continuously parse new PDF files arriving into the directory.


 Pipeline p = ...;

 PCollection<ParseResult> results =
   p.apply(FileIO.match().filepattern("gs://my-bucket/files/*.pdf")
       .continuously(...))
    .apply(FileIO.readMatches())
    .apply(TikaIO.parseFiles());

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

TikaIO.Parse

Implementation of parse().

static class

TikaIO.ParseFiles

Implementation of parseFiles().
Constructor Summary

Constructors

Constructor

Description

TikaIO()
Method Summary

Modifier and Type

Method

Description

static TikaIO.Parse

parse()

Parses files matching a given filepattern.

static TikaIO.ParseFiles

parseFiles()

Parses files in a PCollection of FileIO.ReadableFile.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- TikaIO
  
  public TikaIO()
Method Details
- parse
  
  public static TikaIO.Parse parse()
  
  Parses files matching a given filepattern.
- parseFiles
  
  public static TikaIO.ParseFiles parseFiles()
  
  Parses files in a PCollection of FileIO.ReadableFile.

Class TikaIO

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

TikaIO

Method Details

parse

parseFiles