@Experimental(value=SOURCE_SINK) public class TikaIO extends java.lang.Object
PTransform
for parsing arbitrary files using Apache Tika.
Files in many well known text, binary or scientific formats can be processed.
To read a PCollection
from one or more files
use TikaIO.Read.from(String)
to specify the path of the file(s) to be read.
TikaIO.Read
returns a bounded PCollection
of Strings
,
each corresponding to a sequence of characters reported by Apache Tika SAX Parser.
Example:
Pipeline p = ...;
// A simple Read of a local PDF file (only runs locally):
PCollection<String> content = p.apply(TikaInput.from("/local/path/to/file.pdf"));
Warning: the API of this IO is likely to change in the next release.Modifier and Type | Class and Description |
---|---|
static class |
TikaIO.Read
Implementation of
read() . |
Constructor and Description |
---|
TikaIO() |
Modifier and Type | Method and Description |
---|---|
static TikaIO.Read |
read()
A
PTransform that parses one or more files and returns a bounded PCollection
containing one element for each sequence of characters reported by Apache Tika SAX Parser. |
public static TikaIO.Read read()
PTransform
that parses one or more files and returns a bounded PCollection
containing one element for each sequence of characters reported by Apache Tika SAX Parser.