apache_beam.io.filesystem module¶
File system abstraction for file-based sources and sinks.
- Note to implementors:
“path” arguments will be URLs in the form scheme://foo/bar. The exception is LocalFileSystem, which gets unix-style paths in the form /foo/bar.
- class apache_beam.io.filesystem.CompressionTypes[source]¶
Bases:
object
Enum-like class representing known compression types.
- AUTO = 'auto'¶
- BZIP2 = 'bzip2'¶
- DEFLATE = 'deflate'¶
- ZSTD = 'zstd'¶
- GZIP = 'gzip'¶
- LZMA = 'lzma'¶
- UNCOMPRESSED = 'uncompressed'¶
- class apache_beam.io.filesystem.CompressedFile(fileobj: BinaryIO, compression_type='gzip', read_size=16777216)[source]¶
Bases:
object
File wrapper for easier handling of compressed files.
- seek(offset: int, whence: int = 0) None [source]¶
Set the file’s current offset.
Seeking behavior:
seeking from the end
os.SEEK_END
the whole file is decompressed once to determine its size. Therefore it is preferred to useos.SEEK_SET
oros.SEEK_CUR
to avoid the processing overheadseeking backwards from the current position rewinds the file to
0
and decompresses the chunks to the requested offsetseeking is only supported in files opened for reading
if the new offset is out of bound, it is adjusted to either
0
orEOF
.
- Parameters:
offset (int) – seek offset in the uncompressed content represented as number
whence (int) – seek mode. Supported modes are
os.SEEK_SET
(absolute seek),os.SEEK_CUR
(seek relative to the current position), andos.SEEK_END
(seek relative to the end, offset should be negative).
- Raises:
IOError – When this buffer is closed.
ValueError – When whence is invalid or the file is not seekable
- class apache_beam.io.filesystem.FileMetadata(path: str, size_in_bytes: int, last_updated_in_seconds: float = 0.0)[source]¶
Bases:
object
Metadata about a file path that is the output of FileSystem.match.
- Fields:
path: [Required] file path. size_in_bytes: [Required] file size in bytes. last_updated_in_seconds: [Optional] last modified timestamp of the file, or valued 0.0 if not specified.
- class apache_beam.io.filesystem.FileSystem(pipeline_options)[source]¶
Bases:
BeamPlugin
A class that defines the functions that can be performed on a filesystem.
All methods are abstract and they are for file system providers to implement. Clients should use the FileSystems class to interact with the correct file system based on the provided file pattern scheme.
- Parameters:
pipeline_options – Instance of
PipelineOptions
or dict of options and values (likeRuntimeValueProvider.runtime_options
).
- CHUNK_SIZE = 1¶
- abstract join(basepath: str, *paths: str) str [source]¶
Join two or more pathname components for the filesystem
- Parameters:
basepath – string path of the first component of the path
paths – path components to be added
Returns: full path after combining all the passed components
- abstract split(path: str) Tuple[str, str] [source]¶
Splits the given path into two parts.
Splits the path into a pair (head, tail) such that tail contains the last component of the path and head contains everything up to that.
For file-systems other than the local file-system, head should include the prefix.
- Parameters:
path – path as a string
- Returns:
a pair of path components as strings.
- abstract mkdirs(path)[source]¶
Recursively create directories for the provided path.
- Parameters:
path – string path of the directory structure that should be created
- Raises:
IOError – if leaf directory already exists.
- match_files(file_metas: List[FileMetadata], pattern: str) Iterator[FileMetadata] [source]¶
Filter
FileMetadata
objects by pattern- Parameters:
file_metas (list of
FileMetadata
) – Files to consider when matchingpattern (str) – File pattern
See also
- Returns:
Generator of matching
FileMetadata
- static translate_pattern(pattern: str) str [source]¶
Translate a pattern to a regular expression. There is no way to quote meta-characters.
- Pattern syntax:
The pattern syntax is based on the fnmatch syntax, with the following differences:
*
Is equivalent to[^/\]*
rather than.*
.**
Is equivalent to.*
.
See also
match()
uses this methodThis method is based on Python 2.7’s fnmatch.translate. The code in this method is licensed under PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2.
- match(patterns, limits=None)[source]¶
Find all matching paths to the patterns provided.
See also
Patterns ending with ‘/’ or ‘' will be appended with ‘*’.
- Parameters:
patterns – list of string for the file path pattern to match against
limits – list of maximum number of responses that need to be fetched
Returns: list of
MatchResult
objects.- Raises:
BeamIOError – if any of the pattern match operations fail
- abstract create(path, mime_type='application/octet-stream', compression_type='auto') BinaryIO [source]¶
Returns a write channel for the given file path.
- Parameters:
path – string path of the file object to be written to the system
mime_type – MIME type to specify the type of content in the file object
compression_type – Type of compression to be used for this object
Returns: file handle with a close function for the user to use
- abstract open(path, mime_type='application/octet-stream', compression_type='auto') BinaryIO [source]¶
Returns a read channel for the given file path.
- Parameters:
path – string path of the file object to be read
mime_type – MIME type to specify the type of content in the file object
compression_type – Type of compression to be used for this object
Returns: file handle with a close function for the user to use
- abstract copy(source_file_names, destination_file_names)[source]¶
Recursively copy the file tree from the source to the destination
- Parameters:
source_file_names – list of source file objects that needs to be copied
destination_file_names – list of destination of the new object
- Raises:
BeamIOError – if any of the copy operations fail
- abstract rename(source_file_names, destination_file_names)[source]¶
Rename the files at the source list to the destination list. Source and destination lists should be of the same size.
- Parameters:
source_file_names – List of file paths that need to be moved
destination_file_names – List of destination_file_names for the files
- Raises:
BeamIOError – if any of the rename operations fail
- abstract exists(path: str) bool [source]¶
Check if the provided path exists on the FileSystem.
- Parameters:
path – string path that needs to be checked.
Returns: boolean flag indicating if path exists
- abstract size(path: str) int [source]¶
Get size in bytes of a file on the FileSystem.
- Parameters:
path – string filepath of file.
Returns: int size of file according to the FileSystem.
- Raises:
BeamIOError – if path doesn’t exist.
- abstract last_updated(path)[source]¶
Get UNIX Epoch time in seconds on the FileSystem.
- Parameters:
path – string path of file.
Returns: float UNIX Epoch time
- Raises:
BeamIOError – if path doesn’t exist.
- checksum(path)[source]¶
Fetch checksum metadata of a file on the
FileSystem
.This operation returns checksum metadata as stored in the underlying FileSystem. It should not need to read file data to obtain this value. Checksum type and format are FileSystem dependent and are not compatible between FileSystems. FileSystem implementations may return file size if a checksum isn’t available.
- Parameters:
path – string path of a file.
Returns: string containing checksum
- Raises:
BeamIOError – if path isn’t a file or doesn’t exist.
- abstract metadata(path)[source]¶
Fetch metadata of a file on the
FileSystem
.This operation returns metadata as stored in the underlying FileSystem. It should not need to read file data to obtain this value. For web based file systems, this method should also incur as few as possible requests.
- Parameters:
path – string path of a file.
- Returns:
- Raises:
BeamIOError – if path isn’t a file or doesn’t exist.