T
- Type of records represented by the source.public abstract class FileBasedSource<T> extends OffsetBasedSource<T>
Source
s. Extend this class to implement your own
file-based custom source.
A file-based Source
is a Source
backed by a file pattern defined as a Java
glob, a single file, or a offset range for a single file. See OffsetBasedSource
and
RangeTracker
for semantics of offset ranges.
This source stores a String
that is a FileSystems
specification for a file or
file pattern. There should be a FileSystem
registered for the file specification
provided. Please refer to FileSystems
and FileSystem
for more information on
this.
In addition to the methods left abstract from BoundedSource
, subclasses must implement
methods to create a sub-source and a reader for a range of a single file - createForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata, long, long)
and createSingleFileReader(org.apache.beam.sdk.options.PipelineOptions)
. Please refer to TextIO.TextSource
for an example implementation of FileBasedSource
.
Modifier and Type | Class and Description |
---|---|
static class |
FileBasedSource.FileBasedReader<T>
A
reader that implements code common to readers of FileBasedSource s. |
static class |
FileBasedSource.Mode
A given
FileBasedSource represents a file resource of one of these types. |
OffsetBasedSource.OffsetBasedReader<T>
BoundedSource.BoundedReader<T>
Source.Reader<T>
Modifier | Constructor and Description |
---|---|
protected |
FileBasedSource(MatchResult.Metadata fileMetadata,
long minBundleSize,
long startOffset,
long endOffset)
Create a
FileBasedSource based on a single file. |
protected |
FileBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec,
EmptyMatchTreatment emptyMatchTreatment,
long minBundleSize)
Create a
FileBaseSource based on a file or a file pattern specification, with the given
strategy for treating filepatterns that do not match any files. |
protected |
FileBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec,
long minBundleSize)
Like
FileBasedSource(ValueProvider, EmptyMatchTreatment, long) , but uses the default
value of EmptyMatchTreatment.DISALLOW . |
Modifier and Type | Method and Description |
---|---|
protected abstract FileBasedSource<T> |
createForSubrangeOfFile(MatchResult.Metadata fileMetadata,
long start,
long end)
Creates and returns a new
FileBasedSource of the same type as the current FileBasedSource backed by a given file and an offset range. |
BoundedSource.BoundedReader<T> |
createReader(PipelineOptions options)
Returns a new
BoundedSource.BoundedReader that reads from this source. |
protected abstract FileBasedSource.FileBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates and returns an instance of a
FileBasedReader implementation for the current
source assuming the source represents a single file. |
FileBasedSource<T> |
createSourceForSubrange(long start,
long end)
Returns an
OffsetBasedSource for a subrange of the current source. |
EmptyMatchTreatment |
getEmptyMatchTreatment() |
long |
getEstimatedSizeBytes(PipelineOptions options)
An estimate of the total size (in bytes) of the data that would be read from this source.
|
java.lang.String |
getFileOrPatternSpec() |
ValueProvider<java.lang.String> |
getFileOrPatternSpecProvider() |
long |
getMaxEndOffset(PipelineOptions options)
Returns the actual ending offset of the current source.
|
FileBasedSource.Mode |
getMode() |
protected MatchResult.Metadata |
getSingleFileMetadata()
Returns the information about the single file that this source is reading from.
|
protected boolean |
isSplittable()
Determines whether a file represented by this source is can be split into bundles.
|
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
java.util.List<? extends FileBasedSource<T>> |
split(long desiredBundleSizeBytes,
PipelineOptions options)
Splits the source into bundles of approximately
desiredBundleSizeBytes . |
java.lang.String |
toString() |
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
getDefaultOutputCoder, getOutputCoder
protected FileBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize)
FileBaseSource
based on a file or a file pattern specification, with the given
strategy for treating filepatterns that do not match any files.protected FileBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec, long minBundleSize)
FileBasedSource(ValueProvider, EmptyMatchTreatment, long)
, but uses the default
value of EmptyMatchTreatment.DISALLOW
.protected FileBasedSource(MatchResult.Metadata fileMetadata, long minBundleSize, long startOffset, long endOffset)
FileBasedSource
based on a single file. This constructor must be used when
creating a new FileBasedSource
for a subrange of a single file. Additionally, this
constructor must be used to create new FileBasedSource
s when subclasses implement the
method createForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata, long, long)
.
See OffsetBasedSource
for detailed descriptions of minBundleSize
, startOffset
, and endOffset
.
fileMetadata
- specification of the file represented by the FileBasedSource
, in
suitable form for use with FileSystems.match(List)
.minBundleSize
- minimum bundle size in bytes.startOffset
- starting byte offset.endOffset
- ending byte offset. If the specified value >= #getMaxEndOffset()
it
implies #getMaxEndOffSet()
.protected final MatchResult.Metadata getSingleFileMetadata()
java.lang.IllegalArgumentException
- if this source is in FileBasedSource.Mode.FILEPATTERN
mode.public final java.lang.String getFileOrPatternSpec()
public final ValueProvider<java.lang.String> getFileOrPatternSpecProvider()
public final EmptyMatchTreatment getEmptyMatchTreatment()
public final FileBasedSource.Mode getMode()
public final FileBasedSource<T> createSourceForSubrange(long start, long end)
OffsetBasedSource
OffsetBasedSource
for a subrange of the current source. The subrange [start, end)
must be within the range [startOffset, endOffset)
of the current source,
i.e. startOffset <= start < end <= endOffset
.createSourceForSubrange
in class OffsetBasedSource<T>
protected abstract FileBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata fileMetadata, long start, long end)
FileBasedSource
of the same type as the current FileBasedSource
backed by a given file and an offset range. When current source is being
split, this method is used to generate new sub-sources. When creating the source subclasses
must call the constructor #FileBasedSource(Metadata, long, long, long)
of FileBasedSource
with corresponding parameter values passed here.fileMetadata
- file backing the new FileBasedSource
.start
- starting byte offset of the new FileBasedSource
.end
- ending byte offset of the new FileBasedSource
. May be Long.MAX_VALUE, in
which case it will be inferred using getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions)
.protected abstract FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
FileBasedReader
implementation for the current
source assuming the source represents a single file. File patterns will be handled by FileBasedSource
implementation automatically.public final long getEstimatedSizeBytes(PipelineOptions options) throws java.io.IOException
BoundedSource
If there is no way to estimate the size of the source implementations MAY return 0L.
getEstimatedSizeBytes
in class OffsetBasedSource<T>
java.io.IOException
public void populateDisplayData(DisplayData.Builder builder)
Source
populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData)
. Implementations may call super.populateDisplayData(builder)
in order to register display data in the current namespace,
but should otherwise use subcomponent.populateDisplayData(builder)
to use the namespace
of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData
in interface HasDisplayData
populateDisplayData
in class OffsetBasedSource<T>
builder
- The builder to populate with display data.HasDisplayData
public final java.util.List<? extends FileBasedSource<T>> split(long desiredBundleSizeBytes, PipelineOptions options) throws java.lang.Exception
BoundedSource
desiredBundleSizeBytes
.split
in class OffsetBasedSource<T>
java.lang.Exception
protected boolean isSplittable() throws java.lang.Exception
By default, a source in mode FileBasedSource.Mode.FILEPATTERN
is always splittable, because
splitting will involve expanding the file pattern and producing single-file/subrange sources,
which may or may not be splittable themselves.
By default, a source in FileBasedSource.Mode.SINGLE_FILE_OR_SUBRANGE
is splittable if it is on a
file system that supports efficient read seeking.
Subclasses may override to provide different behavior.
java.lang.Exception
public final BoundedSource.BoundedReader<T> createReader(PipelineOptions options) throws java.io.IOException
BoundedSource
BoundedSource.BoundedReader
that reads from this source.createReader
in class BoundedSource<T>
java.io.IOException
public java.lang.String toString()
toString
in class OffsetBasedSource<T>
public void validate()
Source
It is recommended to use Preconditions
for implementing this
method.
validate
in class OffsetBasedSource<T>
public final long getMaxEndOffset(PipelineOptions options) throws java.io.IOException
OffsetBasedSource
[startOffset, endOffset)
such that the range
used is [startOffset, min(endOffset, maxEndOffset))
.
As an example in which OffsetBasedSource
is used to implement a file source, suppose
that this source was constructed with an endOffset
of Long.MAX_VALUE
to
indicate that a file should be read to the end. Then this function should determine the actual,
exact size of the file in bytes and return it.
getMaxEndOffset
in class OffsetBasedSource<T>
java.io.IOException