apache_beam.io.hadoopfilesystem module

FileSystem implementation for accessing Hadoop Distributed File System files.

class apache_beam.io.hadoopfilesystem.HadoopFileSystem(pipeline_options)[source]

Bases: apache_beam.io.filesystem.FileSystem

FileSystem implementation that supports HDFS.

URL arguments to methods expect strings starting with hdfs://.

Initializes a connection to HDFS.

Connection configuration is done by passing pipeline options. See HadoopFileSystemOptions.

classmethod scheme()[source]
join(base_url, *paths)[source]

Join two or more pathname components.

Parameters:
  • base_url – string path of the first component of the path. Must start with hdfs://.
  • paths – path components to be added
Returns:

Full url after combining all the passed components.

split(url)[source]
mkdirs(url)[source]
has_dirs()[source]
create(url, mime_type='application/octet-stream', compression_type='auto')[source]
Returns:A Python File-like object.
open(url, mime_type='application/octet-stream', compression_type='auto')[source]
Returns:A Python File-like object.
copy(source_file_names, destination_file_names)[source]

It is an error if any file to copy already exists at the destination.

Raises BeamIOError if any error occurred.

Parameters:
  • source_file_names – iterable of URLs.
  • destination_file_names – iterable of URLs.
rename(source_file_names, destination_file_names)[source]
exists(url)[source]

Checks existence of url in HDFS.

Parameters:url – String in the form hdfs://…
Returns:True if url exists as a file or directory in HDFS.
size(url)[source]

Fetches file size for a URL.

Returns:int size of path according to the FileSystem.
Raises:BeamIOError – if url doesn’t exist.
last_updated(url)[source]

Fetches last updated time for a URL.

Parameters:url – string url of file.

Returns: float UNIX Epoch time

Raises:BeamIOError – if path doesn’t exist.
checksum(url)[source]

Fetches a checksum description for a URL.

Returns:String describing the checksum.
Raises:BeamIOError – if url doesn’t exist.
metadata(url)[source]

Fetch metadata fields of a file on the FileSystem.

Parameters:url – string url of a file.
Returns:FileMetadata.
Raises:BeamIOError – if url doesn’t exist.
delete(urls)[source]