apache_beam.io.gcp.gcsio module¶
Google Cloud Storage client.
This library evolved from the Google App Engine GCS client available at https://github.com/GoogleCloudPlatform/appengine-gcs-client.
Updates to the I/O connector code
For any significant updates to this I/O connector, please consider involving corresponding code reviewers mentioned in https://github.com/apache/beam/blob/master/sdks/python/OWNERS
-
class
apache_beam.io.gcp.gcsio.
GcsIO
(storage_client=None, pipeline_options=None)[source]¶ Bases:
object
Google Cloud Storage I/O client.
-
get_bucket
(bucket_name)[source]¶ Returns an object bucket from its name, or None if it does not exist.
-
create_bucket
(bucket_name, project, kms_key=None, location=None)[source]¶ Create and return a GCS bucket in a specific project.
-
open
(filename, mode='r', read_buffer_size=16777216, mime_type='application/octet-stream')[source]¶ Open a GCS file path for reading or writing.
Parameters: Returns: GCS file object.
Raises: ValueError
– Invalid open file mode.
-
delete
(path)[source]¶ Deletes the object at the given GCS path.
Parameters: path – GCS file path pattern in the form gs://<bucket>/<name>.
-
delete_batch
(paths)[source]¶ Deletes the objects at the given GCS paths.
Parameters: paths – List of GCS file path patterns in the form gs://<bucket>/<name>, not to exceed MAX_BATCH_OPERATION_SIZE in length. - Returns: List of tuples of (path, exception) in the same order as the paths
- argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.
-
copy
(src, dest, dest_kms_key_name=None, max_bytes_rewritten_per_call=None)[source]¶ Copies the given GCS object from src to dest.
Parameters: - src – GCS file path pattern in the form gs://<bucket>/<name>.
- dest – GCS file path pattern in the form gs://<bucket>/<name>.
- dest_kms_key_name – Experimental. No backwards compatibility guarantees. Encrypt dest with this Cloud KMS key. If None, will use dest bucket encryption defaults.
- max_bytes_rewritten_per_call – Experimental. No backwards compatibility guarantees. Each rewrite API call will return after these many bytes. Used for testing.
Raises: TimeoutError
– on timeout.
-
copy_batch
(src_dest_pairs, dest_kms_key_name=None, max_bytes_rewritten_per_call=None)[source]¶ Copies the given GCS object from src to dest.
Parameters: - src_dest_pairs – list of (src, dest) tuples of gs://<bucket>/<name> files paths to copy from src to dest, not to exceed MAX_BATCH_OPERATION_SIZE in length.
- dest_kms_key_name – Experimental. No backwards compatibility guarantees. Encrypt dest with this Cloud KMS key. If None, will use dest bucket encryption defaults.
- max_bytes_rewritten_per_call – Experimental. No backwards compatibility guarantees. Each rewrite call will return after these many bytes. Used primarily for testing.
- Returns: List of tuples of (src, dest, exception) in the same order as the
- src_dest_pairs argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.
-
copytree
(src, dest)[source]¶ Renames the given GCS “directory” recursively from src to dest.
Parameters: - src – GCS file path pattern in the form gs://<bucket>/<name>/.
- dest – GCS file path pattern in the form gs://<bucket>/<name>/.
-
rename
(src, dest)[source]¶ Renames the given GCS object from src to dest.
Parameters: - src – GCS file path pattern in the form gs://<bucket>/<name>.
- dest – GCS file path pattern in the form gs://<bucket>/<name>.
-
exists
(path)[source]¶ Returns whether the given GCS object exists.
Parameters: path – GCS file path pattern in the form gs://<bucket>/<name>.
-
checksum
(path)[source]¶ Looks up the checksum of a GCS object.
Parameters: path – GCS file path pattern in the form gs://<bucket>/<name>.
-
size
(path)[source]¶ Returns the size of a single GCS object.
This method does not perform glob expansion. Hence the given path must be for a single GCS object.
Returns: size of the GCS object in bytes.
-
kms_key
(path)[source]¶ Returns the KMS key of a single GCS object.
This method does not perform glob expansion. Hence the given path must be for a single GCS object.
- Returns: KMS key name of the GCS object as a string, or None if it doesn’t
- have one.
-
last_updated
(path)[source]¶ Returns the last updated epoch time of a single GCS object.
This method does not perform glob expansion. Hence the given path must be for a single GCS object.
Returns: last updated time of the GCS object in second.
-
list_prefix
(path, with_metadata=False)[source]¶ Lists files matching the prefix.
list_prefix
has been deprecated. Use list_files instead, which returns a generator of file information instead of a dict.Parameters: - path – GCS file path pattern in the form gs://<bucket>/[name].
- with_metadata – Experimental. Specify whether returns file metadata.
Returns: - dict of file name -> size; if
with_metadata
is True: dict of file name -> tuple(size, timestamp).
Return type: If
with_metadata
is False
-
list_files
(path, with_metadata=False)[source]¶ Lists files matching the prefix.
Parameters: - path – GCS file path pattern in the form gs://<bucket>/[name].
- with_metadata – Experimental. Specify whether returns file metadata.
Returns: generator of tuple(file name, size); if
with_metadata
is True: generator of tuple(file name, tuple(size, timestamp)).Return type: If
with_metadata
is False
-