apache_beam.io.hadoopfilesystem module

FileSystem implementation for accessing Hadoop Distributed File System files.

class apache_beam.io.hadoopfilesystem.HadoopFileSystem(pipeline_options)[source]

Bases: FileSystem

FileSystem implementation that supports HDFS.

URL arguments to methods expect strings starting with hdfs://.

Initializes a connection to HDFS.

Connection configuration is done by passing pipeline options. See HadoopFileSystemOptions.

classmethod scheme()[source]
join(base_url, *paths)[source]

Join two or more pathname components.

Parameters:
  • base_url – string path of the first component of the path. Must start with hdfs://.

  • paths – path components to be added

Returns:

Full url after combining all the passed components.

split(url)[source]
mkdirs(url)[source]
has_dirs()[source]
create(url, mime_type='application/octet-stream', compression_type='auto') BinaryIO[source]
Returns:

A Python File-like object.

open(url, mime_type='application/octet-stream', compression_type='auto') BinaryIO[source]
Returns:

A Python File-like object.

copy(source_file_names, destination_file_names)[source]

It is an error if any file to copy already exists at the destination.

Raises BeamIOError if any error occurred.

Parameters:
  • source_file_names – iterable of URLs.

  • destination_file_names – iterable of URLs.

rename(source_file_names, destination_file_names)[source]
exists(url: str) bool[source]

Checks existence of url in HDFS.

Parameters:

url – String in the form hdfs://…

Returns:

True if url exists as a file or directory in HDFS.

size(url)[source]

Fetches file size for a URL.

Returns:

int size of path according to the FileSystem.

Raises:

BeamIOError – if url doesn’t exist.

last_updated(url)[source]

Fetches last updated time for a URL.

Parameters:

url – string url of file.

Returns: float UNIX Epoch time

Raises:

BeamIOError – if path doesn’t exist.

checksum(url)[source]

Fetches a checksum description for a URL.

Returns:

String describing the checksum.

Raises:

BeamIOError – if url doesn’t exist.

metadata(url)[source]

Fetch metadata fields of a file on the FileSystem.

Parameters:

url – string url of a file.

Returns:

FileMetadata.

Raises:

BeamIOError – if url doesn’t exist.

delete(urls)[source]