apache_beam.io.filesystem module

File system abstraction for file-based sources and sinks.

class apache_beam.io.filesystem.CompressionTypes[source]

Bases: object

Enum-like class representing known compression types.

AUTO = 'auto'
BZIP2 = 'bzip2'
GZIP = 'gzip'
UNCOMPRESSED = 'uncompressed'
classmethod is_valid_compression_type(compression_type)[source]

Returns True for valid compression types, False otherwise.

classmethod mime_type(compression_type, default='application/octet-stream')[source]
classmethod detect_compression_type(file_path)[source]

Returns the compression type of a file (based on its suffix).

class apache_beam.io.filesystem.CompressedFile(fileobj, compression_type='gzip', read_size=16777216)[source]

Bases: object

File wrapper for easier handling of compressed files.

readable()[source]
writeable()[source]
write(data)[source]

Write data to file.

read(num_bytes)[source]
readline()[source]

Equivalent to standard file.readline(). Same return conventions apply.

closed()[source]
close()[source]
flush()[source]
seekable
seek(offset, whence=0)[source]

Set the file’s current offset.

Seeking behavior:

  • seeking from the end os.SEEK_END the whole file is decompressed once to determine it’s size. Therefore it is preferred to use os.SEEK_SET or os.SEEK_CUR to avoid the processing overhead
  • seeking backwards from the current position rewinds the file to 0 and decompresses the chunks to the requested offset
  • seeking is only supported in files opened for reading
  • if the new offset is out of bound, it is adjusted to either 0 or EOF.
Parameters:
  • offset (int) – seek offset in the uncompressed content represented as number
  • whence (int) – seek mode. Supported modes are os.SEEK_SET (absolute seek), os.SEEK_CUR (seek relative to the current position), and os.SEEK_END (seek relative to the end, offset should be negative).
Raises:
  • IOError – When this buffer is closed.
  • ValueError – When whence is invalid or the file is not seekable
tell()[source]

Returns current position in uncompressed file.

class apache_beam.io.filesystem.FileMetadata(path, size_in_bytes)[source]

Bases: object

Metadata about a file path that is the output of FileSystem.match

class apache_beam.io.filesystem.MatchResult(pattern, metadata_list)[source]

Bases: object

Result from the FileSystem match operation which contains the list of matched FileMetadata.

class apache_beam.io.filesystem.FileSystem(pipeline_options)[source]

Bases: apache_beam.utils.plugin.BeamPlugin

A class that defines the functions that can be performed on a filesystem.

All methods are abstract and they are for file system providers to implement. Clients should use the FileSystems class to interact with the correct file system based on the provided file pattern scheme.

Parameters:pipeline_options – Instance of PipelineOptions.
CHUNK_SIZE = 1
classmethod scheme()[source]

URI scheme for the FileSystem

join(basepath, *paths)[source]

Join two or more pathname components for the filesystem

Parameters:
  • basepath – string path of the first component of the path
  • paths – path components to be added

Returns: full path after combining all the passed components

split(path)[source]

Splits the given path into two parts.

Splits the path into a pair (head, tail) such that tail contains the last component of the path and head contains everything up to that.

For file-systems other than the local file-system, head should include the prefix.

Parameters:path – path as a string
Returns:a pair of path components as strings.
mkdirs(path)[source]

Recursively create directories for the provided path.

Parameters:path – string path of the directory structure that should be created
Raises:IOError if leaf directory already exists.
match(patterns, limits=None)[source]

Find all matching paths to the patterns provided.

Parameters:
  • patterns – list of string for the file path pattern to match against
  • limits – list of maximum number of responses that need to be fetched

Returns: list of MatchResult objects.

Raises:BeamIOError if any of the pattern match operations fail
create(path, mime_type='application/octet-stream', compression_type='auto')[source]

Returns a write channel for the given file path.

Parameters:
  • path – string path of the file object to be written to the system
  • mime_type – MIME type to specify the type of content in the file object
  • compression_type – Type of compression to be used for this object

Returns: file handle with a close function for the user to use

open(path, mime_type='application/octet-stream', compression_type='auto')[source]

Returns a read channel for the given file path.

Parameters:
  • path – string path of the file object to be read
  • mime_type – MIME type to specify the type of content in the file object
  • compression_type – Type of compression to be used for this object

Returns: file handle with a close function for the user to use

copy(source_file_names, destination_file_names)[source]

Recursively copy the file tree from the source to the destination

Parameters:
  • source_file_names – list of source file objects that needs to be copied
  • destination_file_names – list of destination of the new object
Raises:

BeamIOError if any of the copy operations fail

rename(source_file_names, destination_file_names)[source]

Rename the files at the source list to the destination list. Source and destination lists should be of the same size.

Parameters:
  • source_file_names – List of file paths that need to be moved
  • destination_file_names – List of destination_file_names for the files
Raises:

BeamIOError if any of the rename operations fail

exists(path)[source]

Check if the provided path exists on the FileSystem.

Parameters:path – string path that needs to be checked.

Returns: boolean flag indicating if path exists

delete(paths)[source]

Deletes files or directories at the provided paths. Directories will be deleted recursively.

Parameters:paths – list of paths that give the file objects to be deleted
Raises:BeamIOError if any of the delete operations fail