apache_beam.io.mongodbio module

This module implements IO classes to read and write data on MongoDB.

Read from MongoDB

ReadFromMongoDB is a PTransform that reads from a configured MongoDB source and returns a PCollection of dict representing MongoDB documents. To configure MongoDB source, the URI to connect to MongoDB server, database name, collection name needs to be provided.

Example usage:

pipeline | ReadFromMongoDB(uri='mongodb://localhost:27017',
                           db='testdb',
                           coll='input')

To read from MongoDB Atlas, use bucket_auto option to enable @bucketAuto MongoDB aggregation instead of splitVector command which is a high-privilege function that cannot be assigned to any user in Atlas.

Example usage:

pipeline | ReadFromMongoDB(uri='mongodb+srv://user:pwd@cluster0.mongodb.net',
                           db='testdb',
                           coll='input',
                           bucket_auto=True)

Write to MongoDB:

WriteToMongoDB is a PTransform that writes MongoDB documents to configured sink, and the write is conducted through a mongodb bulk_write of ReplaceOne operations. If the document’s _id field already existed in the MongoDB collection, it results in an overwrite, otherwise, a new document will be inserted.

Example usage:

pipeline | WriteToMongoDB(uri='mongodb://localhost:27017',
                          db='testdb',
                          coll='output',
                          batch_size=10)

No backward compatibility guarantees. Everything in this module is experimental.

apache_beam.io.mongodbio.ReadFromMongoDB(uri='mongodb://localhost:27017', db=None, coll=None, filter=None, projection=None, extra_client_params=None, bucket_auto=False)[source]

A PTransform to read MongoDB documents into a PCollection.

apache_beam.io.mongodbio.WriteToMongoDB(uri='mongodb://localhost:27017', db=None, coll=None, batch_size=100, extra_client_params=None)[source]

WriteToMongoDB is a PTransform that writes a PCollection of mongodb document to the configured MongoDB server.

In order to make the document writes idempotent so that the bundles are retry-able without creating duplicates, the PTransform added 2 transformations before final write stage: a GenerateId transform and a Reshuffle transform.:

              -----------------------------------------------
Pipeline -->  |GenerateId --> Reshuffle --> WriteToMongoSink|
              -----------------------------------------------
                              (WriteToMongoDB)

The GenerateId transform adds a random and unique*_id* field to the documents if they don’t already have one, it uses the same format as MongoDB default. The Reshuffle transform makes sure that no fusion happens between GenerateId and the final write stage transform,so that the set of documents and their unique IDs are not regenerated if final write step is retried due to a failure. This prevents duplicate writes of the same document with different unique IDs.