apache_beam.io.filesystems module

FileSystems interface class for accessing the correct filesystem

class apache_beam.io.filesystems.FileSystems[source]

Bases: object

A class that defines the functions that can be performed on a filesystem. All methods are static and access the underlying registered filesystems.

URI_SCHEMA_PATTERN = re.compile('(?P<scheme>[a-zA-Z][-a-zA-Z0-9+.]*)://.*')
classmethod set_options(pipeline_options)[source]

Set filesystem options.

Parameters:

pipeline_options – Instance of PipelineOptions.

static get_scheme(path)[source]
static get_filesystem(path: str) FileSystems[source]

Get the correct filesystem for the specified path

static join(basepath: str, *paths: str) str[source]

Join two or more pathname components for the filesystem

Parameters:
  • basepath – string path of the first component of the path

  • paths – path components to be added

Returns: full path after combining all the passed components

static split(path)[source]

Splits the given path into two parts.

Splits the path into a pair (head, tail) such that tail contains the last component of the path and head contains everything up to that.

For file-systems other than the local file-system, head should include the prefix.

Parameters:

path – path as a string

Returns:

a pair of path components as strings.

static mkdirs(path)[source]

Recursively create directories for the provided path.

Parameters:

path – string path of the directory structure that should be created

Raises:

IOError – if leaf directory already exists.

static match(patterns, limits=None)[source]

Find all matching paths to the patterns provided.

Pattern matching is done using each filesystem’s match method (e.g. filesystem.FileSystem.match()).

Note

  • Depending on the FileSystem implementation, file listings (the .FileSystem._list method) may not be recursive.

  • If the file listing is not recursive, a pattern like scheme://path/*/foo will not be able to mach any files.

Pattern syntax:

The pattern syntax is based on the fnmatch syntax, with the following differences:

  • * Is equivalent to [^/\]* rather than .*.

  • ** Is equivalent to .*.

Parameters:
  • patterns – list of string for the file path pattern to match against

  • limits – list of maximum number of responses that need to be fetched

Returns: list of MatchResult objects.

Raises:

BeamIOError – if any of the pattern match operations fail

static create(path, mime_type='application/octet-stream', compression_type='auto') BinaryIO[source]

Returns a write channel for the given file path.

Parameters:
  • path – string path of the file object to be written to the system

  • mime_type – MIME type to specify the type of content in the file object

  • compression_type – Type of compression to be used for this object. See CompressionTypes for possible values.

Returns: file handle with a close function for the user to use.

static open(path, mime_type='application/octet-stream', compression_type='auto') BinaryIO[source]

Returns a read channel for the given file path.

Parameters:
  • path – string path of the file object to be written to the system

  • mime_type – MIME type to specify the type of content in the file object

  • compression_type – Type of compression to be used for this object. See CompressionTypes for possible values.

Returns: file handle with a close function for the user to use.

static copy(source_file_names, destination_file_names)[source]

Recursively copy the file list from the source to the destination

Parameters:
  • source_file_names – list of source file objects that needs to be copied

  • destination_file_names – list of destination of the new object

Raises:

BeamIOError – if any of the copy operations fail

static rename(source_file_names, destination_file_names)[source]

Rename the files at the source list to the destination list. Source and destination lists should be of the same size.

Parameters:
  • source_file_names – List of file paths that need to be moved

  • destination_file_names – List of destination_file_names for the files

Raises:

BeamIOError – if any of the rename operations fail

static exists(path)[source]

Check if the provided path exists on the FileSystem.

Parameters:

path – string path that needs to be checked.

Returns: boolean flag indicating if path exists

static last_updated(path)[source]

Get UNIX Epoch time in seconds on the FileSystem.

Parameters:

path – string path of file.

Returns: float UNIX Epoch time

Raises:

BeamIOError – if path doesn’t exist.

static checksum(path)[source]

Fetch checksum metadata of a file on the FileSystem.

This operation returns checksum metadata as stored in the underlying FileSystem. It should not read any file data. Checksum type and format are FileSystem dependent and are not compatible between FileSystems.

Parameters:

path – string path of a file.

Returns: string containing checksum

Raises:

BeamIOError – if path isn’t a file or doesn’t exist.

static delete(paths)[source]

Deletes files or directories at the provided paths. Directories will be deleted recursively.

Parameters:

paths – list of paths that give the file objects to be deleted

Raises:

BeamIOError – if any of the delete operations fail

static get_chunk_size(path)[source]

Get the correct chunk size for the FileSystem.

Parameters:

path – string path that needs to be checked.

Returns: integer size for parallelization in the FS operations.

static report_source_lineage(path, level=None)[source]

Report source Lineage.

Parameters:
  • path – string path to be reported.

  • level – the level of file path. default to Lineage.FILE.

static report_sink_lineage(path, level=None)[source]

Report sink Lineage.

Parameters:
  • path – string path to be reported.

  • level – the level of file path. default to Lineage.FILE.