apache_beam.io.gcp.datastore.v1new.datastoreio module

A connector for reading from and writing to Google Cloud Datastore.

This module uses the newer google-cloud-datastore client package. Its API was different enough to require extensive changes to this and associated modules.

class apache_beam.io.gcp.datastore.v1new.datastoreio.ReadFromDatastore(query, num_splits=0)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for querying Google Cloud Datastore.

To read a PCollection[Entity] from a Cloud Datastore Query, use the ReadFromDatastore transform by providing a query to read from. The project and optional namespace are set in the query. The query will be split into multiple queries to allow for parallelism. The degree of parallelism is automatically determined, but can be overridden by setting num_splits to a value of 1 or greater.

Note: Normally, a runner will read from Cloud Datastore in parallel across many workers. However, when the query is configured with a limit or if the query contains inequality filters like GREATER_THAN, LESS_THAN etc., then all the returned results will be read by a single worker in order to ensure correct data. Since data is read from a single worker, this could have significant impact on the performance of the job. Using a Reshuffle transform after the read in this case might be beneficial for parallelizing work across workers.

The semantics for query splitting is defined below:

1. If num_splits is equal to 0, then the number of splits will be chosen dynamically at runtime based on the query data size.

2. Any value of num_splits greater than ReadFromDatastore._NUM_QUERY_SPLITS_MAX will be capped at that value.

3. If the query has a user limit set, or contains inequality filters, then num_splits will be ignored and no split will be performed.

4. Under certain cases Cloud Datastore is unable to split query to the requested number of splits. In such cases we just use whatever Cloud Datastore returns.

See https://developers.google.com/datastore/ for more details on Google Cloud Datastore.

Initialize the ReadFromDatastore transform.

This transform outputs elements of type Entity.

Parameters:
  • query – (Query) query used to fetch entities.
  • num_splits – (int) (optional) Number of splits for the query.
expand(pcoll)[source]
display_data()[source]
class apache_beam.io.gcp.datastore.v1new.datastoreio.WriteToDatastore(project, throttle_rampup=True, hint_num_workers=500)[source]

Bases: apache_beam.io.gcp.datastore.v1new.datastoreio._Mutate

Writes elements of type Entity to Cloud Datastore.

Entity keys must be complete. The project field in each key must match the project ID passed to this transform. If project field in entity or property key is empty then it is filled with the project ID passed to this transform.

Initialize the WriteToDatastore transform.

Parameters:
  • project – (str) The ID of the project to write entities to.
  • throttle_rampup – Whether to enforce a gradual ramp-up.
  • hint_num_workers – A hint for the expected number of workers, used to estimate appropriate limits during ramp-up throttling.
class apache_beam.io.gcp.datastore.v1new.datastoreio.DeleteFromDatastore(project, throttle_rampup=True, hint_num_workers=500)[source]

Bases: apache_beam.io.gcp.datastore.v1new.datastoreio._Mutate

Deletes elements matching input Key elements from Cloud Datastore.

Keys must be complete. The project field in each key must match the project ID passed to this transform. If project field in key is empty then it is filled with the project ID passed to this transform.

Initialize the DeleteFromDatastore transform.

Parameters:
  • project – (str) The ID of the project from which the entities will be deleted.
  • throttle_rampup – Whether to enforce a gradual ramp-up.
  • hint_num_workers – A hint for the expected number of workers, used to estimate appropriate limits during ramp-up throttling.