Overview: Developing a new I/O connector
A guide for users who need to connect to a data store that isn’t supported by the Built-in I/O connectors
To connect to a data store that isn’t supported by Beam’s existing I/O connectors, you must create a custom I/O connector. A connector usually consists of a source and a sink. All Beam sources and sinks are composite transforms; however, the implementation of your custom I/O depends on your use case. Here are the recommended steps to get started:
Read this overview and choose your implementation. You can email the Beam dev mailing list with any questions you might have. In addition, you can check if anyone else is working on the same I/O connector.
If you plan to contribute your I/O connector to the Beam community, see the Apache Beam contribution guide.
Read the PTransform style guide for additional style guide recommendations.
For bounded (batch) sources, there are currently two options for creating a Beam source:
Splittable DoFn is the recommended option, as it’s the most recent source framework for both
bounded and unbounded sources. This is meant to replace the
in the new system. Read
Splittable DoFn Programming Guide for how to write one
Splittable DoFn. For more information, see the
roadmap for multi-SDK connector efforts.
For Java and Python unbounded (streaming) sources, you must use the
Splittable DoFn, which
supports features that are useful for streaming pipelines, including checkpointing, controlling
watermark, and tracking backlog.
When to use the Splittable DoFn interface
If you are not sure whether to use
Splittable DoFn, feel free to email the
Beam dev mailing list and we can discuss the specific pros and cons of your
In some cases, implementing a
Splittable DoFn might be necessary or result in better performance:
ParDodoes not work for reading from unbounded sources.
ParDodoes not support checkpointing or mechanisms like de-duping that are useful for streaming data sources.
Progress and size estimation:
ParDocan’t provide hints to runners about progress or the size of data they are reading. Without size estimation of the data or progress on your read, the runner doesn’t have any way to guess how large your read will be. Therefore, if the runner attempts to dynamically allocate workers, it does not have any clues as to how many workers you might need for your pipeline.
Dynamic work rebalancing:
ParDodoes not support dynamic work rebalancing, which is used by some readers to improve the processing speed of jobs. Depending on your data source, dynamic work rebalancing might not be possible.
Splitting initially to increase parallelism:
ParDodoes not have the ability to perform initial splitting.
For example, if you’d like to read from a new file format that contains many records per file, or if you’d like to read from a key-value store that supports read operations in sorted key order.
I/O examples using SDFs
- Kafka: An I/O connector for Apache Kafka (an open-source distributed event streaming platform).
- Watch: Uses a polling function producing a growing set of outputs for each input until a per-input termination condition is met.
- Parquet: An I/O connector for Apache Parquet (an open-source columnar storage format).
- HL7v2: An I/O connector for HL7v2 messages (a clinical messaging format that provides data about events that occur inside an organization) part of Google’s Cloud Healthcare API.
- BoundedSource wrapper: A wrapper which converts an existing BoundedSource implementation to a splittable DoFn.
- UnboundedSource wrapper: A wrapper which converts an existing UnboundedSource implementation to a splittable DoFn.
- BoundedSourceWrapper: A wrapper which converts an existing BoundedSource implementation to a splittable DoFn.
Using ParDo and GroupByKey
For data stores or file types where the data can be read in parallel, you can think of the process as a mini-pipeline. This often consists of two steps:
Splitting the data into parts to be read in parallel
Reading from each of those parts
Each of those steps will be a
ParDo, with a
GroupByKey in between. The
GroupByKey is an implementation detail, but for most runners
allows the runner to use different numbers of workers in some situations:
Determining how to split up the data to be read into chunks
Reading data, which often benefits from more workers
GroupByKey also allows dynamic work rebalancing to happen on
runners that support the feature.
Here are some examples of read transform implementations that use the “reading as a mini-pipeline” model when data can be read in parallel:
Reading from a file glob: For example, reading all files in “~/data/**”.
- Get File Paths
ParDo: As input, take in a file glob. Produce a
PCollectionof strings, each of which is a file path.
ParDo: Given the
PCollectionof file paths, read each one, producing a
- Get File Paths
Reading from a NoSQL database (such as Apache HBase): These databases often allow reading from ranges in parallel.
- Determine Key Ranges
ParDo: As input, receive connection information for the database and the key range to read from. Produce a
PCollectionof key ranges that can be read in parallel efficiently.
- Read Key Range
ParDo: Given the
PCollectionof key ranges, read the key range, producing a
- Determine Key Ranges
For data stores or files where reading cannot occur in parallel, reading is a
simple task that can be accomplished with a single
Reading from a database query: Traditional SQL database queries often can only be read in sequence. In this case, the
ParDowould establish a connection to the database and read batches of records, producing a
PCollectionof those records.
Reading from a gzip file: A gzip file must be read in order, so the read cannot be parallelized. In this case, the
ParDowould open the file and read in sequence, producing a
PCollectionof records from the file.
To create a Beam sink, we recommend that you use a
ParDo that writes the
received records to the data store. To develop more complex sinks (for example,
to support data de-duplication when failures are retried by a runner), use
GroupByKey, and other available Beam transforms.
Many data services are optimized to write batches of elements at a time,
so it may make sense to group the elements into batches before writing.
Persistent connections can be initialized in a DoFn’s
method rather than upon the receipt of every element as well.
It should also be noted that in a large-scale, distributed system work can
fail and/or be retried, so it is preferable to
make the external interactions idempotent when possible.
For file-based sinks, you can use the
FileBasedSink abstraction that is
provided by both the Java and Python SDKs. Beam’s
FileSystems utility classes
can also be useful for reading and writing files. See our language specific
implementation guides for more details:
Last updated on 2024/02/23
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!