apache_beam.io.filebasedsource module

A framework for developing sources for new file types.

To create a source for a new file type a sub-class of FileBasedSource should be created. Sub-classes of FileBasedSource must implement the method FileBasedSource.read_records(). Please read the documentation of that method for more details.

For an example implementation of FileBasedSource see _AvroSource.

class apache_beam.io.filebasedsource.FileBasedSource(file_pattern, min_bundle_size=0, compression_type='auto', splittable=True, validate=True)[source]

Bases: BoundedSource

A BoundedSource for reading a file glob of a given type.

Initializes FileBasedSource.

Parameters:
  • file_pattern (str) – the file glob to read a string or a ValueProvider (placeholder to inject a runtime value).

  • min_bundle_size (int) – minimum size of bundles that should be generated when performing initial splitting on this source.

  • compression_type (str) – Used to handle compressed output files. Typical value is CompressionTypes.AUTO, in which case the final file path’s extension will be used to detect the compression.

  • splittable (bool) – whether FileBasedSource should try to logically split a single file into data ranges so that different parts of the same file can be read in parallel. If set to False, FileBasedSource will prevent both initial and dynamic splitting of sources for single files. File patterns that represent multiple files may still get split into sources for individual files. Even if set to True by the user, FileBasedSource may choose to not split the file, for example, for compressed files where currently it is not possible to efficiently read a data range without decompressing the whole file.

  • validate (bool) – Boolean flag to verify that the files exist during the pipeline creation time.

Raises:
  • TypeError – when compression_type is not valid or if file_pattern is not a str or a ValueProvider.

  • ValueError – when compression and splittable files are specified.

  • IOError – when the file pattern specified yields an empty result.

MIN_NUMBER_OF_FILES_TO_STAT = 100
MIN_FRACTION_OF_FILES_TO_STAT = 0.01
display_data()[source]
open_file(file_name)[source]
split(desired_bundle_size=None, start_position=None, stop_position=None)[source]
estimate_size()[source]
read(range_tracker)[source]
get_range_tracker(start_position, stop_position)[source]
read_records(file_name, offset_range_tracker)[source]

Returns a generator of records created by reading file ‘file_name’.

Parameters:
  • file_name – a string that gives the name of the file to be read. Method FileBasedSource.open_file() must be used to open the file and create a seekable file object.

  • offset_range_tracker – a object of type OffsetRangeTracker. This defines the byte range of the file that should be read. See documentation in iobase.BoundedSource.read() for more information on reading records while complying to the range defined by a given RangeTracker.

Returns:

an iterator that gives the records read from the given file.

property splittable