apache_beam.io.gcp.gcsio module
Google Cloud Storage client.
This library evolved from the Google App Engine GCS client available at https://github.com/GoogleCloudPlatform/appengine-gcs-client.
Updates to the I/O connector code
For any significant updates to this I/O connector, please consider involving corresponding code reviewers mentioned in https://github.com/apache/beam/blob/master/sdks/python/OWNERS
- class apache_beam.io.gcp.gcsio.GcsIO(storage_client: Client | None = None, pipeline_options: dict | PipelineOptions | None = None)[source]
Bases:
object
Google Cloud Storage I/O client.
- get_bucket(bucket_name, **kwargs)[source]
Returns an object bucket from its name, or None if it does not exist.
- create_bucket(bucket_name, project, kms_key=None, location=None, soft_delete_retention_duration_seconds=0)[source]
Create and return a GCS bucket in a specific project.
- open(filename, mode='r', read_buffer_size=16777216, mime_type='application/octet-stream')[source]
Open a GCS file path for reading or writing.
- Parameters:
- Returns:
GCS file object.
- Raises:
ValueError – Invalid open file mode.
- delete(path)[source]
Deletes the object at the given GCS path.
- Parameters:
path – GCS file path pattern in the form gs://<bucket>/<name>.
- delete_batch(paths)[source]
Deletes the objects at the given GCS paths. Warning: any exception during batch delete will NOT be retried.
- Parameters:
paths – List of GCS file path patterns or Dict with GCS file path patterns as keys. The patterns are in the form gs://<bucket>/<name>, but not to exceed MAX_BATCH_OPERATION_SIZE in length.
- Returns: List of tuples of (path, exception) in the same order as the
paths argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.
- copy(src, dest)[source]
Copies the given GCS object from src to dest.
- Parameters:
src – GCS file path pattern in the form gs://<bucket>/<name>.
dest – GCS file path pattern in the form gs://<bucket>/<name>.
- Raises:
Any exceptions during copying –
- copy_batch(src_dest_pairs)[source]
Copies the given GCS objects from src to dest. Warning: any exception during batch copy will NOT be retried.
- Parameters:
src_dest_pairs – list of (src, dest) tuples of gs://<bucket>/<name> files paths to copy from src to dest, not to exceed MAX_BATCH_OPERATION_SIZE in length.
- Returns: List of tuples of (src, dest, exception) in the same order as the
src_dest_pairs argument, where exception is None if the operation succeeded or the relevant exception if the operation failed.
- copytree(src, dest)[source]
Renames the given GCS “directory” recursively from src to dest.
- Parameters:
src – GCS file path pattern in the form gs://<bucket>/<name>/.
dest – GCS file path pattern in the form gs://<bucket>/<name>/.
- rename(src, dest)[source]
Renames the given GCS object from src to dest.
- Parameters:
src – GCS file path pattern in the form gs://<bucket>/<name>.
dest – GCS file path pattern in the form gs://<bucket>/<name>.
- exists(path)[source]
Returns whether the given GCS object exists.
- Parameters:
path – GCS file path pattern in the form gs://<bucket>/<name>.
- checksum(path)[source]
Looks up the checksum of a GCS object.
- Parameters:
path – GCS file path pattern in the form gs://<bucket>/<name>.
- size(path)[source]
Returns the size of a single GCS object.
This method does not perform glob expansion. Hence the given path must be for a single GCS object.
Returns: size of the GCS object in bytes.
- kms_key(path)[source]
Returns the KMS key of a single GCS object.
This method does not perform glob expansion. Hence the given path must be for a single GCS object.
- Returns: KMS key name of the GCS object as a string, or None if it doesn’t
have one.
- last_updated(path)[source]
Returns the last updated epoch time of a single GCS object.
This method does not perform glob expansion. Hence the given path must be for a single GCS object.
Returns: last updated time of the GCS object in second.
- list_prefix(path, with_metadata=False)[source]
Lists files matching the prefix.
list_prefix
has been deprecated. Use list_files instead, which returns a generator of file information instead of a dict.- Parameters:
path – GCS file path pattern in the form gs://<bucket>/[name].
with_metadata – Experimental. Specify whether returns file metadata.
- Returns:
- dict of file name -> size; if
with_metadata
is True: dict of file name -> tuple(size, timestamp).
- Return type:
If
with_metadata
is False
- list_files(path, with_metadata=False)[source]
Lists files matching the prefix.
- Parameters:
path – GCS file path pattern in the form gs://<bucket>/[name].
with_metadata – Experimental. Specify whether returns file metadata.
- Returns:
generator of tuple(file name, size); if
with_metadata
is True: generator of tuple(file name, tuple(size, timestamp)).- Return type:
If
with_metadata
is False
- apache_beam.io.gcp.gcsio.create_storage_client(pipeline_options, use_credentials=True)[source]
Create a GCS client for Beam via GCS Client Library.
- Parameters:
pipeline_options (apache_beam.options.pipeline_options.PipelineOptions) – the options of the pipeline.
use_credentials (bool) – whether to create an authenticated client based on pipeline options or an anonymous client.
- Returns:
A google.cloud.storage.client.Client instance.