public abstract static class TikaIO.Read extends PTransform<PBegin,PCollection<java.lang.String>>
TikaIO.read()
.Modifier and Type | Field and Description |
---|---|
static long |
DEFAULT_QUEUE_MAX_POLL_TIME |
static long |
DEFAULT_QUEUE_POLL_TIME |
name
Constructor and Description |
---|
Read() |
Modifier and Type | Method and Description |
---|---|
PCollection<java.lang.String> |
expand(PBegin input)
Override this method to specify how this
PTransform should be expanded
on the given InputT . |
TikaIO.Read |
from(java.lang.String filepattern)
A
PTransform that parses one or more files with the given filename
or filename pattern and returns a bounded PCollection containing
one element for each sequence of characters reported by Apache Tika SAX Parser. |
TikaIO.Read |
from(ValueProvider<java.lang.String> filepattern)
Same as
from(filepattern) , but accepting a ValueProvider . |
protected Coder<java.lang.String> |
getDefaultOutputCoder()
Returns the default
Coder to use for the output of this single-output PTransform . |
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
TikaIO.Read |
withContentTypeHint(java.lang.String contentType)
Returns a new transform which will use the provided content type hint
to make the file parser detection more efficient.
|
TikaIO.Read |
withInputMetadata(Metadata metadata)
Returns a new transform which will use the provided input metadata
for parsing the files.
|
TikaIO.Read |
withMinimumTextlength(java.lang.Integer value)
Returns a new transform which will operate on the text blocks with the
given minimum text length.
|
TikaIO.Read |
withOptions(TikaOptions options)
Path to Tika configuration resource.
|
TikaIO.Read |
withParseSynchronously(java.lang.Boolean value)
Returns a new transform which will use the synchronous reader.
|
TikaIO.Read |
withQueueMaxPollTime(java.lang.Long value)
Returns a new transform which will use the specified queue max poll time.
|
TikaIO.Read |
withQueuePollTime(java.lang.Long value)
Returns a new transform which will use the specified queue poll time.
|
TikaIO.Read |
withReadOutputMetadata(java.lang.Boolean value)
Returns a new transform which will report the metadata.
|
TikaIO.Read |
withTikaConfigPath(java.lang.String tikaConfigPath)
Returns a new transform which will use the custom TikaConfig.
|
TikaIO.Read |
withTikaConfigPath(ValueProvider<java.lang.String> tikaConfigPath)
Same as
with(tikaConfigPath) , but accepting a ValueProvider . |
getAdditionalInputs, getDefaultOutputCoder, getDefaultOutputCoder, getKindString, getName, toString, validate
public static final long DEFAULT_QUEUE_POLL_TIME
public static final long DEFAULT_QUEUE_MAX_POLL_TIME
public TikaIO.Read from(java.lang.String filepattern)
PTransform
that parses one or more files with the given filename
or filename pattern and returns a bounded PCollection
containing
one element for each sequence of characters reported by Apache Tika SAX Parser.
Filepattern can be a local path (if running locally), or a Google Cloud Storage
filename or filename pattern of the form "gs://<bucket>/<filepath>"
(if running locally or using remote execution service).
Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.
public TikaIO.Read from(ValueProvider<java.lang.String> filepattern)
from(filepattern)
, but accepting a ValueProvider
.public TikaIO.Read withTikaConfigPath(java.lang.String tikaConfigPath)
public TikaIO.Read withTikaConfigPath(ValueProvider<java.lang.String> tikaConfigPath)
with(tikaConfigPath)
, but accepting a ValueProvider
.public TikaIO.Read withContentTypeHint(java.lang.String contentType)
public TikaIO.Read withInputMetadata(Metadata metadata)
public TikaIO.Read withReadOutputMetadata(java.lang.Boolean value)
public TikaIO.Read withQueuePollTime(java.lang.Long value)
public TikaIO.Read withQueueMaxPollTime(java.lang.Long value)
public TikaIO.Read withMinimumTextlength(java.lang.Integer value)
public TikaIO.Read withParseSynchronously(java.lang.Boolean value)
public TikaIO.Read withOptions(TikaOptions options)
public PCollection<java.lang.String> expand(PBegin input)
PTransform
PTransform
should be expanded
on the given InputT
.
NOTE: This method should not be called directly. Instead apply the
PTransform
should be applied to the InputT
using the apply
method.
Composite transforms, which are defined in terms of other transforms, should return the output of one of the composed transforms. Non-composite transforms, which do not apply any transforms internally, should return a new unbound output and register evaluators (via backend-specific registration methods).
expand
in class PTransform<PBegin,PCollection<java.lang.String>>
public void populateDisplayData(DisplayData.Builder builder)
PTransform
populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData)
. Implementations may call
super.populateDisplayData(builder)
in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder)
to use
the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData
in interface HasDisplayData
populateDisplayData
in class PTransform<PBegin,PCollection<java.lang.String>>
builder
- The builder to populate with display data.HasDisplayData
protected Coder<java.lang.String> getDefaultOutputCoder()
PTransform
Coder
to use for the output of this single-output PTransform
.
By default, always throws
getDefaultOutputCoder
in class PTransform<PBegin,PCollection<java.lang.String>>