apache_beam.io.gcp.datastore.v1 package

Submodules

apache_beam.io.gcp.datastore.v1.datastoreio module

A connector for reading from and writing to Google Cloud Datastore

class apache_beam.io.gcp.datastore.v1.datastoreio.ReadFromDatastore(project, query, namespace=None, num_splits=0)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for reading from Google Cloud Datastore.

To read a PCollection[Entity] from a Cloud Datastore Query, use ReadFromDatastore transform by providing a project id and a query to read from. You can optionally provide a namespace and/or specify how many splits you want for the query through num_splits option.

Note: Normally, a runner will read from Cloud Datastore in parallel across many workers. However, when the query is configured with a limit or if the query contains inequality filters like GREATER_THAN, LESS_THAN etc., then all the returned results will be read by a single worker in order to ensure correct data. Since data is read from a single worker, this could have significant impact on the performance of the job.

The semantics for the query splitting is defined below:

1. If num_splits is equal to 0, then the number of splits will be chosen dynamically at runtime based on the query data size.

2. Any value of num_splits greater than ReadFromDatastore._NUM_QUERY_SPLITS_MAX will be capped at that value.

3. If the query has a user limit set, or contains inequality filters, then num_splits will be ignored and no split will be performed.

4. Under certain cases Cloud Datastore is unable to split query to the requested number of splits. In such cases we just use whatever the Cloud Datastore returns.

See https://developers.google.com/datastore/ for more details on Google Cloud Datastore.

class ReadFn(project, namespace=None)[source]

Bases: apache_beam.transforms.core.DoFn

A DoFn that reads entities from Cloud Datastore, for a given query.

display_data()[source]
process(query, *args, **kwargs)[source]
start_bundle()[source]
class SplitQueryFn(project, query, namespace, num_splits)[source]

Bases: apache_beam.transforms.core.DoFn

A DoFn that splits a given query into multiple sub-queries.

display_data()[source]
process(query, *args, **kwargs)[source]
start_bundle()[source]
display_data()[source]
expand(pcoll)[source]
static get_estimated_num_splits(project, namespace, query, datastore)[source]

Computes the number of splits to be performed on the given query.

static get_estimated_size_bytes(project, namespace, query, datastore)[source]

Get the estimated size of the data returned by the given query.

Cloud Datastore provides no way to get a good estimate of how large the result of a query is going to be. Hence we use the __Stat_Kind__ system table to get size of the entire kind as an approximate estimate, assuming exactly 1 kind is specified in the query. See https://cloud.google.com/datastore/docs/concepts/stats.

static query_latest_statistics_timestamp(project, namespace, datastore)[source]

Fetches the latest timestamp of statistics from Cloud Datastore.

Cloud Datastore system tables with statistics are periodically updated. This method fethes the latest timestamp (in microseconds) of statistics update using the __Stat_Total__ table.

class apache_beam.io.gcp.datastore.v1.datastoreio.WriteToDatastore(project)[source]

Bases: apache_beam.io.gcp.datastore.v1.datastoreio._Mutate

A PTransform to write a PCollection[Entity] to Cloud Datastore.

static to_upsert_mutation(entity)[source]
class apache_beam.io.gcp.datastore.v1.datastoreio.DeleteFromDatastore(project)[source]

Bases: apache_beam.io.gcp.datastore.v1.datastoreio._Mutate

A PTransform to delete a PCollection[Key] from Cloud Datastore.

static to_delete_mutation(key)[source]

apache_beam.io.gcp.datastore.v1.fake_datastore module

Fake datastore used for unit testing.

For internal use only; no backwards-compatibility guarantees.

apache_beam.io.gcp.datastore.v1.fake_datastore.create_commit(mutations)[source]

A fake Datastore commit method that writes the mutations to a list.

Parameters:mutations – A list to write mutations to.
Returns:A fake Datastore commit method
apache_beam.io.gcp.datastore.v1.fake_datastore.create_entities(count, id_or_name=False)[source]

Creates a list of entities with random keys.

apache_beam.io.gcp.datastore.v1.fake_datastore.create_response(entities, end_cursor, finish)[source]

Creates a query response for a given batch of scatter entities.

apache_beam.io.gcp.datastore.v1.fake_datastore.create_run_query(entities, batch_size)[source]

A fake datastore run_query method that returns entities in batches.

Note: the outer method is needed to make the entities and batch_size available in the scope of fake_run_query method.

Parameters:
  • entities – list of entities supposed to be contained in the datastore.
  • batch_size – the number of entities that run_query method returns in one request.

apache_beam.io.gcp.datastore.v1.helper module

Cloud Datastore helper functions.

For internal use only; no backwards-compatibility guarantees.

class apache_beam.io.gcp.datastore.v1.helper.QueryIterator(project, namespace, query, datastore)[source]

Bases: object

A iterator class for entities of a given query.

Entities are read in batches. Retries on failures.

apache_beam.io.gcp.datastore.v1.helper.compare_path(p1, p2)[source]

A comparator for key path.

A path has either an id or a name field defined. The comparison works with the following rules:

1. If one path has id defined while the other doesn’t, then the one with id defined is considered smaller. 2. If both paths have id defined, then their ids are compared. 3. If no id is defined for both paths, then their names are compared.

apache_beam.io.gcp.datastore.v1.helper.fetch_entities(project, namespace, query, datastore)[source]

A helper method to fetch entities from Cloud Datastore.

Parameters:
  • project – Project ID
  • namespace – Cloud Datastore namespace
  • query – Query to be read from
  • datastore – Cloud Datastore Client
Returns:

An iterator of entities.

apache_beam.io.gcp.datastore.v1.helper.get_datastore(project)[source]

Returns a Cloud Datastore client.

apache_beam.io.gcp.datastore.v1.helper.is_key_valid(key)[source]

Returns True if a Cloud Datastore key is complete.

A key is complete if its last element has either an id or a name.

apache_beam.io.gcp.datastore.v1.helper.key_comparator(k1, k2)[source]

A comparator for Datastore keys.

Comparison is only valid for keys in the same partition. The comparison here is between the list of paths for each key.

apache_beam.io.gcp.datastore.v1.helper.make_kind_stats_query(namespace, kind, latest_timestamp)[source]

Make a Query to fetch the latest kind statistics.

apache_beam.io.gcp.datastore.v1.helper.make_latest_timestamp_query(namespace)[source]

Make a Query to fetch the latest timestamp statistics.

apache_beam.io.gcp.datastore.v1.helper.make_partition(project, namespace)[source]

Make a PartitionId for the given project and namespace.

apache_beam.io.gcp.datastore.v1.helper.make_request(project, namespace, query)[source]

Make a Cloud Datastore request for the given query.

apache_beam.io.gcp.datastore.v1.helper.retry_on_rpc_error(exception)[source]

A retry filter for Cloud Datastore RPCErrors.

apache_beam.io.gcp.datastore.v1.helper.write_mutations(datastore, project, mutations)[source]

A helper function to write a batch of mutations to Cloud Datastore.

If a commit fails, it will be retried upto 5 times. All mutations in the batch will be committed again, even if the commit was partially successful. If the retry limit is exceeded, the last exception from Cloud Datastore will be raised.

apache_beam.io.gcp.datastore.v1.query_splitter module

Implements a Cloud Datastore query splitter.

apache_beam.io.gcp.datastore.v1.query_splitter.get_splits(datastore, query, num_splits, partition=None)[source]

Returns a list of sharded queries for the given Cloud Datastore query.

This will create up to the desired number of splits, however it may return less splits if the desired number of splits is unavailable. This will happen if the number of split points provided by the underlying Datastore is less than the desired number, which will occur if the number of results for the query is too small.

This implementation of the QuerySplitter uses the __scatter__ property to gather random split points for a query.

Note: This implementation is derived from the java query splitter in https://github.com/GoogleCloudPlatform/google-cloud-datastore/blob/master/java/datastore/src/main/java/com/google/datastore/v1/client/QuerySplitterImpl.java

Parameters:
  • datastore – the datastore client.
  • query – the query to split.
  • num_splits – the desired number of splits.
  • partition – the partition the query is running in.
Returns:

A list of split queries, of a max length of num_splits

Module contents