Class FileBasedSource<T>

Type Parameters:
T - Type of records represented by the source.
All Implemented Interfaces:
Serializable, HasDisplayData
Direct Known Subclasses:
BlockBasedSource, CompressedSource, TextSource, XmlSource

public abstract class FileBasedSource<T> extends OffsetBasedSource<T>
A common base class for all file-based Sources. Extend this class to implement your own file-based custom source.

A file-based Source is a Source backed by a file pattern defined as a Java glob, a single file, or a offset range for a single file. See OffsetBasedSource and RangeTracker for semantics of offset ranges.

This source stores a String that is a FileSystems specification for a file or file pattern. There should be a FileSystem registered for the file specification provided. Please refer to FileSystems and FileSystem for more information on this.

In addition to the methods left abstract from BoundedSource, subclasses must implement methods to create a sub-source and a reader for a range of a single file - createForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata, long, long) and createSingleFileReader(org.apache.beam.sdk.options.PipelineOptions). Please refer to TextIO.TextSource for an example implementation of FileBasedSource.

See Also:
  • Constructor Details

  • Method Details

    • getSingleFileMetadata

      public final MatchResult.Metadata getSingleFileMetadata()
      Returns the information about the single file that this source is reading from.
      Throws:
      IllegalArgumentException - if this source is in FileBasedSource.Mode.FILEPATTERN mode.
    • getFileOrPatternSpec

      public final String getFileOrPatternSpec()
    • getFileOrPatternSpecProvider

      public final ValueProvider<String> getFileOrPatternSpecProvider()
    • getEmptyMatchTreatment

      public final EmptyMatchTreatment getEmptyMatchTreatment()
    • getMode

      public final FileBasedSource.Mode getMode()
    • createSourceForSubrange

      public final FileBasedSource<T> createSourceForSubrange(long start, long end)
      Description copied from class: OffsetBasedSource
      Returns an OffsetBasedSource for a subrange of the current source. The subrange [start, end) must be within the range [startOffset, endOffset) of the current source, i.e. startOffset <= start < end <= endOffset.
      Specified by:
      createSourceForSubrange in class OffsetBasedSource<T>
    • createForSubrangeOfFile

      protected abstract FileBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata fileMetadata, long start, long end)
      Creates and returns a new FileBasedSource of the same type as the current FileBasedSource backed by a given file and an offset range. When current source is being split, this method is used to generate new sub-sources. When creating the source subclasses must call the constructor FileBasedSource(Metadata, long, long, long) of FileBasedSource with corresponding parameter values passed here.
      Parameters:
      fileMetadata - file backing the new FileBasedSource.
      start - starting byte offset of the new FileBasedSource.
      end - ending byte offset of the new FileBasedSource. May be Long.MAX_VALUE, in which case it will be inferred using getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions).
    • createSingleFileReader

      protected abstract FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
      Creates and returns an instance of a FileBasedReader implementation for the current source assuming the source represents a single file. File patterns will be handled by FileBasedSource implementation automatically.
    • getEstimatedSizeBytes

      public final long getEstimatedSizeBytes(PipelineOptions options) throws IOException
      Description copied from class: BoundedSource
      An estimate of the total size (in bytes) of the data that would be read from this source. This estimate is in terms of external storage size, before any decompression or other processing done by the reader.

      If there is no way to estimate the size of the source implementations MAY return 0L.

      Overrides:
      getEstimatedSizeBytes in class OffsetBasedSource<T>
      Throws:
      IOException
    • populateDisplayData

      public void populateDisplayData(DisplayData.Builder builder)
      Description copied from class: Source
      Register display data for the given transform or component.

      populateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect display data via DisplayData.from(HasDisplayData). Implementations may call super.populateDisplayData(builder) in order to register display data in the current namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use the namespace of the subcomponent.

      By default, does not register any display data. Implementors may override this method to provide their own display data.

      Specified by:
      populateDisplayData in interface HasDisplayData
      Overrides:
      populateDisplayData in class OffsetBasedSource<T>
      Parameters:
      builder - The builder to populate with display data.
      See Also:
    • split

      public final List<? extends FileBasedSource<T>> split(long desiredBundleSizeBytes, PipelineOptions options) throws Exception
      Description copied from class: BoundedSource
      Splits the source into bundles of approximately desiredBundleSizeBytes.
      Overrides:
      split in class OffsetBasedSource<T>
      Throws:
      Exception
    • isSplittable

      protected boolean isSplittable() throws Exception
      Determines whether a file represented by this source is can be split into bundles.

      By default, a source in mode FileBasedSource.Mode.FILEPATTERN is always splittable, because splitting will involve expanding the file pattern and producing single-file/subrange sources, which may or may not be splittable themselves.

      By default, a source in FileBasedSource.Mode.SINGLE_FILE_OR_SUBRANGE is splittable if it is on a file system that supports efficient read seeking.

      Subclasses may override to provide different behavior.

      Throws:
      Exception
    • createReader

      public final BoundedSource.BoundedReader<T> createReader(PipelineOptions options) throws IOException
      Description copied from class: BoundedSource
      Returns a new BoundedSource.BoundedReader that reads from this source.
      Specified by:
      createReader in class BoundedSource<T>
      Throws:
      IOException
    • toString

      public String toString()
      Overrides:
      toString in class OffsetBasedSource<T>
    • validate

      public void validate()
      Description copied from class: Source
      Checks that this source is valid, before it can be used in a pipeline.

      It is recommended to use Preconditions for implementing this method.

      Overrides:
      validate in class OffsetBasedSource<T>
    • getMaxEndOffset

      public final long getMaxEndOffset(PipelineOptions options) throws IOException
      Description copied from class: OffsetBasedSource
      Returns the actual ending offset of the current source. The value returned by this function will be used to clip the end of the range [startOffset, endOffset) such that the range used is [startOffset, min(endOffset, maxEndOffset)).

      As an example in which OffsetBasedSource is used to implement a file source, suppose that this source was constructed with an endOffset of Long.MAX_VALUE to indicate that a file should be read to the end. Then this function should determine the actual, exact size of the file in bytes and return it.

      Specified by:
      getMaxEndOffset in class OffsetBasedSource<T>
      Throws:
      IOException