apache_beam.io.gcp.gcsio module

Google Cloud Storage client.

This library evolved from the Google App Engine GCS client available at https://github.com/GoogleCloudPlatform/appengine-gcs-client.

Updates to the I/O connector code

For any significant updates to this I/O connector, please consider involving corresponding code reviewers mentioned in https://github.com/apache/beam/blob/master/sdks/python/OWNERS

class apache_beam.io.gcp.gcsio.GcsIO(storage_client=None, pipeline_options=None)[source]

Bases: object

Google Cloud Storage I/O client.

get_project_number(bucket)[source]
get_bucket(bucket_name)[source]

Returns an object bucket from its name, or None if it does not exist.

create_bucket(bucket_name, project, kms_key=None, location=None)[source]

Create and return a GCS bucket in a specific project.

open(filename, mode='r', read_buffer_size=16777216, mime_type='application/octet-stream')[source]

Open a GCS file path for reading or writing.

Parameters:
  • filename (str) – GCS file path in the form gs://<bucket>/<object>.
  • mode (str) – 'r' for reading or 'w' for writing.
  • read_buffer_size (int) – Buffer size to use during read operations.
  • mime_type (str) – Mime type to set for write operations.
Returns:

GCS file object.

Raises:

ValueError – Invalid open file mode.

delete(path)[source]

Deletes the object at the given GCS path.

Parameters:path – GCS file path pattern in the form gs://<bucket>/<name>.
delete_batch(paths)[source]

Deletes the objects at the given GCS paths.

Parameters:paths – List of GCS file path patterns in the form gs://<bucket>/<name>, not to exceed MAX_BATCH_OPERATION_SIZE in length.
Returns: List of tuples of (path, exception) in the same order as the paths
argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.
copy(src, dest, dest_kms_key_name=None, max_bytes_rewritten_per_call=None)[source]

Copies the given GCS object from src to dest.

Parameters:
  • src – GCS file path pattern in the form gs://<bucket>/<name>.
  • dest – GCS file path pattern in the form gs://<bucket>/<name>.
  • dest_kms_key_name – Experimental. No backwards compatibility guarantees. Encrypt dest with this Cloud KMS key. If None, will use dest bucket encryption defaults.
  • max_bytes_rewritten_per_call – Experimental. No backwards compatibility guarantees. Each rewrite API call will return after these many bytes. Used for testing.
Raises:

TimeoutError – on timeout.

copy_batch(src_dest_pairs, dest_kms_key_name=None, max_bytes_rewritten_per_call=None)[source]

Copies the given GCS object from src to dest.

Parameters:
  • src_dest_pairs – list of (src, dest) tuples of gs://<bucket>/<name> files paths to copy from src to dest, not to exceed MAX_BATCH_OPERATION_SIZE in length.
  • dest_kms_key_name – Experimental. No backwards compatibility guarantees. Encrypt dest with this Cloud KMS key. If None, will use dest bucket encryption defaults.
  • max_bytes_rewritten_per_call – Experimental. No backwards compatibility guarantees. Each rewrite call will return after these many bytes. Used primarily for testing.
Returns: List of tuples of (src, dest, exception) in the same order as the
src_dest_pairs argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.
copytree(src, dest)[source]

Renames the given GCS “directory” recursively from src to dest.

Parameters:
  • src – GCS file path pattern in the form gs://<bucket>/<name>/.
  • dest – GCS file path pattern in the form gs://<bucket>/<name>/.
rename(src, dest)[source]

Renames the given GCS object from src to dest.

Parameters:
  • src – GCS file path pattern in the form gs://<bucket>/<name>.
  • dest – GCS file path pattern in the form gs://<bucket>/<name>.
exists(path)[source]

Returns whether the given GCS object exists.

Parameters:path – GCS file path pattern in the form gs://<bucket>/<name>.
checksum(path)[source]

Looks up the checksum of a GCS object.

Parameters:path – GCS file path pattern in the form gs://<bucket>/<name>.
size(path)[source]

Returns the size of a single GCS object.

This method does not perform glob expansion. Hence the given path must be for a single GCS object.

Returns: size of the GCS object in bytes.

kms_key(path)[source]

Returns the KMS key of a single GCS object.

This method does not perform glob expansion. Hence the given path must be for a single GCS object.

Returns: KMS key name of the GCS object as a string, or None if it doesn’t
have one.
last_updated(path)[source]

Returns the last updated epoch time of a single GCS object.

This method does not perform glob expansion. Hence the given path must be for a single GCS object.

Returns: last updated time of the GCS object in second.

list_prefix(path, with_metadata=False)[source]

Lists files matching the prefix.

list_prefix has been deprecated. Use list_files instead, which returns a generator of file information instead of a dict.

Parameters:
  • path – GCS file path pattern in the form gs://<bucket>/[name].
  • with_metadata – Experimental. Specify whether returns file metadata.
Returns:

dict of file name -> size; if

with_metadata is True: dict of file name -> tuple(size, timestamp).

Return type:

If with_metadata is False

list_files(path, with_metadata=False)[source]

Lists files matching the prefix.

Parameters:
  • path – GCS file path pattern in the form gs://<bucket>/[name].
  • with_metadata – Experimental. Specify whether returns file metadata.
Returns:

generator of tuple(file name, size); if with_metadata is True: generator of tuple(file name, tuple(size, timestamp)).

Return type:

If with_metadata is False