apache_beam.io.filebasedsource module¶
A framework for developing sources for new file types.
To create a source for a new file type a sub-class of FileBasedSource
should be created. Sub-classes of FileBasedSource must implement the
method FileBasedSource.read_records(). Please read the documentation of
that method for more details.
For an example implementation of FileBasedSource see
_AvroSource.
-
class
apache_beam.io.filebasedsource.FileBasedSource(file_pattern, min_bundle_size=0, compression_type='auto', splittable=True, validate=True)[source]¶ Bases:
apache_beam.io.iobase.BoundedSourceA
BoundedSourcefor reading a file glob of a given type.Initializes
FileBasedSource.Parameters: - file_pattern (str) – the file glob to read a string or a
ValueProvider(placeholder to inject a runtime value). - min_bundle_size (int) – minimum size of bundles that should be generated when performing initial splitting on this source.
- compression_type (str) – Used to handle compressed output files.
Typical value is
CompressionTypes.AUTO, in which case the final file path’s extension will be used to detect the compression. - splittable (bool) – whether
FileBasedSourceshould try to logically split a single file into data ranges so that different parts of the same file can be read in parallel. If set toFalse,FileBasedSourcewill prevent both initial and dynamic splitting of sources for single files. File patterns that represent multiple files may still get split into sources for individual files. Even if set toTrueby the user,FileBasedSourcemay choose to not split the file, for example, for compressed files where currently it is not possible to efficiently read a data range without decompressing the whole file. - validate (bool) – Boolean flag to verify that the files exist during the pipeline creation time.
Raises: TypeError– when compression_type is not valid or if file_pattern is not astror aValueProvider.ValueError– when compression and splittable files are specified.IOError– when the file pattern specified yields an empty result.
-
MIN_NUMBER_OF_FILES_TO_STAT= 100¶
-
MIN_FRACTION_OF_FILES_TO_STAT= 0.01¶
-
read_records(file_name, offset_range_tracker)[source]¶ Returns a generator of records created by reading file ‘file_name’.
Parameters: - file_name – a
stringthat gives the name of the file to be read. MethodFileBasedSource.open_file()must be used to open the file and create a seekable file object. - offset_range_tracker – a object of type
OffsetRangeTracker. This defines the byte range of the file that should be read. See documentation iniobase.BoundedSource.read()for more information on reading records while complying to the range defined by a givenRangeTracker.
Returns: an iterator that gives the records read from the given file.
- file_name – a
-
splittable¶
- file_pattern (str) – the file glob to read a string or a