apache_beam.io.gcp.datastore.v1.datastoreio module

A connector for reading from and writing to Google Cloud Datastore

class apache_beam.io.gcp.datastore.v1.datastoreio.ReadFromDatastore(project, query, namespace=None, num_splits=0)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for reading from Google Cloud Datastore.

To read a PCollection[Entity] from a Cloud Datastore Query, use ReadFromDatastore transform by providing a project id and a query to read from. You can optionally provide a namespace and/or specify how many splits you want for the query through num_splits option.

Note: Normally, a runner will read from Cloud Datastore in parallel across many workers. However, when the query is configured with a limit or if the query contains inequality filters like GREATER_THAN, LESS_THAN etc., then all the returned results will be read by a single worker in order to ensure correct data. Since data is read from a single worker, this could have significant impact on the performance of the job.

The semantics for the query splitting is defined below:

1. If num_splits is equal to 0, then the number of splits will be chosen dynamically at runtime based on the query data size.

2. Any value of num_splits greater than ReadFromDatastore._NUM_QUERY_SPLITS_MAX will be capped at that value.

3. If the query has a user limit set, or contains inequality filters, then num_splits will be ignored and no split will be performed.

4. Under certain cases Cloud Datastore is unable to split query to the requested number of splits. In such cases we just use whatever the Cloud Datastore returns.

See https://developers.google.com/datastore/ for more details on Google Cloud Datastore.

Initialize the ReadFromDatastore transform.

Parameters:
  • project – The Project ID
  • query – Cloud Datastore query to be read from.
  • namespace – An optional namespace.
  • num_splits – Number of splits for the query.
expand(pcoll)[source]
display_data()[source]
class SplitQueryFn(project, query, namespace, num_splits)[source]

Bases: apache_beam.transforms.core.DoFn

A DoFn that splits a given query into multiple sub-queries.

start_bundle()[source]
process(query, *args, **kwargs)[source]
display_data()[source]
class ReadFn(project, namespace=None)[source]

Bases: apache_beam.transforms.core.DoFn

A DoFn that reads entities from Cloud Datastore, for a given query.

start_bundle()[source]
process(query, *args, **kwargs)[source]
display_data()[source]
static query_latest_statistics_timestamp(project, namespace, datastore)[source]

Fetches the latest timestamp of statistics from Cloud Datastore.

Cloud Datastore system tables with statistics are periodically updated. This method fethes the latest timestamp (in microseconds) of statistics update using the __Stat_Total__ table.

static get_estimated_size_bytes(project, namespace, query, datastore)[source]

Get the estimated size of the data returned by the given query.

Cloud Datastore provides no way to get a good estimate of how large the result of a query is going to be. Hence we use the __Stat_Kind__ system table to get size of the entire kind as an approximate estimate, assuming exactly 1 kind is specified in the query. See https://cloud.google.com/datastore/docs/concepts/stats.

static get_estimated_num_splits(project, namespace, query, datastore)[source]

Computes the number of splits to be performed on the given query.

class apache_beam.io.gcp.datastore.v1.datastoreio.WriteToDatastore(project)[source]

Bases: apache_beam.io.gcp.datastore.v1.datastoreio._Mutate

A PTransform to write a PCollection[Entity] to Cloud Datastore.

static to_upsert_mutation(entity)[source]
class apache_beam.io.gcp.datastore.v1.datastoreio.DeleteFromDatastore(project)[source]

Bases: apache_beam.io.gcp.datastore.v1.datastoreio._Mutate

A PTransform to delete a PCollection[Key] from Cloud Datastore.

static to_delete_mutation(key)[source]