apache_beam.io.hadoopfilesystem module

FileSystem implementation for accessing Hadoop Distributed File System files.

class apache_beam.io.hadoopfilesystem.HadoopFileSystem(pipeline_options)[source]

Bases: apache_beam.io.filesystem.FileSystem

FileSystem implementation that supports HDFS.

URL arguments to methods expect strings starting with hdfs://.

Experimental; TODO(BEAM-3600): Writes are experimental until file rename
retries are better handled.

Initializes a connection to HDFS.

Connection configuration is done by passing pipeline options. See HadoopFileSystemOptions.

classmethod scheme()[source]
join(base_url, *paths)[source]

Join two or more pathname components.

Parameters:
  • base_url – string path of the first component of the path. Must start with hdfs://.
  • paths – path components to be added
Returns:

Full url after combining all the passed components.

split(url)[source]
mkdirs(url)[source]
match(url_patterns, limits=None)[source]
create(url, mime_type='application/octet-stream', compression_type='auto')[source]
Returns:A Python File-like object.
open(url, mime_type='application/octet-stream', compression_type='auto')[source]
Returns:A Python File-like object.
copy(source_file_names, destination_file_names)[source]

It is an error if any file to copy already exists at the destination.

Raises BeamIOError if any error occurred.

Parameters:
  • source_file_names – iterable of URLs.
  • destination_file_names – iterable of URLs.
rename(source_file_names, destination_file_names)[source]
exists(url)[source]

Checks existence of url in HDFS.

Parameters:url – String in the form hdfs://…
Returns:True if url exists as a file or directory in HDFS.
delete(urls)[source]