@Experimental(value=SOURCE_SINK) public class TikaIO extends java.lang.Object
PTransform for parsing arbitrary files using Apache Tika.
Files in many well known text, binary or scientific formats can be processed.
To read a PCollection from one or more files
use TikaIO.Read.from(String)
to specify the path of the file(s) to be read.
TikaIO.Read returns a bounded PCollection of Strings,
each corresponding to a sequence of characters reported by Apache Tika SAX Parser.
Example:
Pipeline p = ...;
// A simple Read of a local PDF file (only runs locally):
PCollection<String> content = p.apply(TikaInput.from("/local/path/to/file.pdf"));
Warning: the API of this IO is likely to change in the next release.| Modifier and Type | Class and Description |
|---|---|
static class |
TikaIO.Read
Implementation of
read(). |
| Constructor and Description |
|---|
TikaIO() |
| Modifier and Type | Method and Description |
|---|---|
static TikaIO.Read |
read()
A
PTransform that parses one or more files and returns a bounded PCollection
containing one element for each sequence of characters reported by Apache Tika SAX Parser. |
public static TikaIO.Read read()
PTransform that parses one or more files and returns a bounded PCollection
containing one element for each sequence of characters reported by Apache Tika SAX Parser.