Class FileBasedSource<T>
- Type Parameters:
T- Type of records represented by the source.
- All Implemented Interfaces:
Serializable,HasDisplayData
- Direct Known Subclasses:
BlockBasedSource,CompressedSource,TextSource,XmlSource
Sources. Extend this class to implement your own
file-based custom source.
A file-based Source is a Source backed by a file pattern defined as a Java
glob, a single file, or a offset range for a single file. See OffsetBasedSource and
RangeTracker for semantics of offset ranges.
This source stores a String that is a FileSystems specification for a file or
file pattern. There should be a FileSystem registered for the file specification
provided. Please refer to FileSystems and FileSystem for more information on
this.
In addition to the methods left abstract from BoundedSource, subclasses must implement
methods to create a sub-source and a reader for a range of a single file - createForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata, long, long) and createSingleFileReader(org.apache.beam.sdk.options.PipelineOptions). Please refer to TextIO.TextSource for an example implementation of FileBasedSource.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classAreaderthat implements code common to readers ofFileBasedSources.static enumA givenFileBasedSourcerepresents a file resource of one of these types.Nested classes/interfaces inherited from class org.apache.beam.sdk.io.OffsetBasedSource
OffsetBasedSource.OffsetBasedReader<T>Nested classes/interfaces inherited from class org.apache.beam.sdk.io.BoundedSource
BoundedSource.BoundedReader<T>Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source
Source.Reader<T> -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedFileBasedSource(MatchResult.Metadata fileMetadata, long minBundleSize, long startOffset, long endOffset) Create aFileBasedSourcebased on a single file.protectedFileBasedSource(ValueProvider<String> fileOrPatternSpec, long minBundleSize) LikeFileBasedSource(ValueProvider, EmptyMatchTreatment, long), but uses the default value ofEmptyMatchTreatment.DISALLOW.protectedFileBasedSource(ValueProvider<String> fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize) Create aFileBaseSourcebased on a file or a file pattern specification, with the given strategy for treating filepatterns that do not match any files. -
Method Summary
Modifier and TypeMethodDescriptionprotected abstract FileBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata fileMetadata, long start, long end) Creates and returns a newFileBasedSourceof the same type as the currentFileBasedSourcebacked by a given file and an offset range.final BoundedSource.BoundedReader<T> createReader(PipelineOptions options) Returns a newBoundedSource.BoundedReaderthat reads from this source.protected abstract FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options) Creates and returns an instance of aFileBasedReaderimplementation for the current source assuming the source represents a single file.final FileBasedSource<T> createSourceForSubrange(long start, long end) Returns anOffsetBasedSourcefor a subrange of the current source.final EmptyMatchTreatmentfinal longgetEstimatedSizeBytes(PipelineOptions options) An estimate of the total size (in bytes) of the data that would be read from this source.final Stringfinal ValueProvider<String> final longgetMaxEndOffset(PipelineOptions options) Returns the actual ending offset of the current source.final FileBasedSource.ModegetMode()final MatchResult.MetadataReturns the information about the single file that this source is reading from.protected booleanDetermines whether a file represented by this source is can be split into bundles.voidpopulateDisplayData(DisplayData.Builder builder) Register display data for the given transform or component.final List<? extends FileBasedSource<T>> split(long desiredBundleSizeBytes, PipelineOptions options) Splits the source into bundles of approximatelydesiredBundleSizeBytes.toString()voidvalidate()Checks that this source is valid, before it can be used in a pipeline.Methods inherited from class org.apache.beam.sdk.io.OffsetBasedSource
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffsetMethods inherited from class org.apache.beam.sdk.io.Source
getDefaultOutputCoder, getOutputCoder
-
Constructor Details
-
FileBasedSource
protected FileBasedSource(ValueProvider<String> fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize) Create aFileBaseSourcebased on a file or a file pattern specification, with the given strategy for treating filepatterns that do not match any files. -
FileBasedSource
LikeFileBasedSource(ValueProvider, EmptyMatchTreatment, long), but uses the default value ofEmptyMatchTreatment.DISALLOW. -
FileBasedSource
protected FileBasedSource(MatchResult.Metadata fileMetadata, long minBundleSize, long startOffset, long endOffset) Create aFileBasedSourcebased on a single file. This constructor must be used when creating a newFileBasedSourcefor a subrange of a single file. Additionally, this constructor must be used to create newFileBasedSources when subclasses implement the methodcreateForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata, long, long).See
OffsetBasedSourcefor detailed descriptions ofminBundleSize,startOffset, andendOffset.- Parameters:
fileMetadata- specification of the file represented by theFileBasedSource, in suitable form for use withFileSystems.match(List).minBundleSize- minimum bundle size in bytes.startOffset- starting byte offset.endOffset- ending byte offset. If the specified value>= #getMaxEndOffset()it implies#getMaxEndOffSet().
-
-
Method Details
-
getSingleFileMetadata
Returns the information about the single file that this source is reading from.- Throws:
IllegalArgumentException- if this source is inFileBasedSource.Mode.FILEPATTERNmode.
-
getFileOrPatternSpec
-
getFileOrPatternSpecProvider
-
getEmptyMatchTreatment
-
getMode
-
createSourceForSubrange
Description copied from class:OffsetBasedSourceReturns anOffsetBasedSourcefor a subrange of the current source. The subrange[start, end)must be within the range[startOffset, endOffset)of the current source, i.e.startOffset <= start < end <= endOffset.- Specified by:
createSourceForSubrangein classOffsetBasedSource<T>
-
createForSubrangeOfFile
protected abstract FileBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata fileMetadata, long start, long end) Creates and returns a newFileBasedSourceof the same type as the currentFileBasedSourcebacked by a given file and an offset range. When current source is being split, this method is used to generate new sub-sources. When creating the source subclasses must call the constructorFileBasedSource(Metadata, long, long, long)ofFileBasedSourcewith corresponding parameter values passed here.- Parameters:
fileMetadata- file backing the newFileBasedSource.start- starting byte offset of the newFileBasedSource.end- ending byte offset of the newFileBasedSource. May be Long.MAX_VALUE, in which case it will be inferred usinggetMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions).
-
createSingleFileReader
protected abstract FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options) Creates and returns an instance of aFileBasedReaderimplementation for the current source assuming the source represents a single file. File patterns will be handled byFileBasedSourceimplementation automatically. -
getEstimatedSizeBytes
Description copied from class:BoundedSourceAn estimate of the total size (in bytes) of the data that would be read from this source. This estimate is in terms of external storage size, before any decompression or other processing done by the reader.If there is no way to estimate the size of the source implementations MAY return 0L.
- Overrides:
getEstimatedSizeBytesin classOffsetBasedSource<T>- Throws:
IOException
-
populateDisplayData
Description copied from class:SourceRegister display data for the given transform or component.populateDisplayData(DisplayData.Builder)is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData). Implementations may callsuper.populateDisplayData(builder)in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayDatain interfaceHasDisplayData- Overrides:
populateDisplayDatain classOffsetBasedSource<T>- Parameters:
builder- The builder to populate with display data.- See Also:
-
split
public final List<? extends FileBasedSource<T>> split(long desiredBundleSizeBytes, PipelineOptions options) throws Exception Description copied from class:BoundedSourceSplits the source into bundles of approximatelydesiredBundleSizeBytes.- Overrides:
splitin classOffsetBasedSource<T>- Throws:
Exception
-
isSplittable
Determines whether a file represented by this source is can be split into bundles.By default, a source in mode
FileBasedSource.Mode.FILEPATTERNis always splittable, because splitting will involve expanding the file pattern and producing single-file/subrange sources, which may or may not be splittable themselves.By default, a source in
FileBasedSource.Mode.SINGLE_FILE_OR_SUBRANGEis splittable if it is on a file system that supports efficient read seeking.Subclasses may override to provide different behavior.
- Throws:
Exception
-
createReader
public final BoundedSource.BoundedReader<T> createReader(PipelineOptions options) throws IOException Description copied from class:BoundedSourceReturns a newBoundedSource.BoundedReaderthat reads from this source.- Specified by:
createReaderin classBoundedSource<T>- Throws:
IOException
-
toString
- Overrides:
toStringin classOffsetBasedSource<T>
-
validate
public void validate()Description copied from class:SourceChecks that this source is valid, before it can be used in a pipeline.It is recommended to use
Preconditionsfor implementing this method.- Overrides:
validatein classOffsetBasedSource<T>
-
getMaxEndOffset
Description copied from class:OffsetBasedSourceReturns the actual ending offset of the current source. The value returned by this function will be used to clip the end of the range[startOffset, endOffset)such that the range used is[startOffset, min(endOffset, maxEndOffset)).As an example in which
OffsetBasedSourceis used to implement a file source, suppose that this source was constructed with anendOffsetofLong.MAX_VALUEto indicate that a file should be read to the end. Then this function should determine the actual, exact size of the file in bytes and return it.- Specified by:
getMaxEndOffsetin classOffsetBasedSource<T>- Throws:
IOException
-