Beam YAML Transform Index

AnomalyDetection

Performs anomaly detection on a PCollection of data.

This PTransform applies an AnomalyDetector or EnsembleAnomalyDetector to the input data and returns a PCollection of AnomalyResult objects.

Examples

# Run a single anomaly detector
p | AnomalyDetection(ZScore(features=["x1"]))

# Run an ensemble anomaly detector
sub_detectors = [ZScore(features=["x1"]), IQR(features=["x2"])]
p | AnomalyDetection(EnsembleAnomalyDetector(
    sub_detectors, aggregation_strategy=AnyVote()))

Configuration

detector ? (Optional) : The AnomalyDetector or EnsembleAnomalyDetector to use.

Usage

type: AnomalyDetection
input: ...
config:
  detector: detector

AssertEqual

Asserts that the input contains exactly the elements provided.

This is primarily used for testing; it will cause the entire pipeline to fail if the input to this transform is not exactly the set of elements given in the config parameter.

As with Create, YAML/JSON-style mappings are interpreted as Beam rows, e.g.

type: AssertEqual
input: SomeTransform
config:
  elements:
     - {a: 0, b: "foo"}
     - {a: 1, b: "bar"}

would ensure that SomeTransform produced exactly two elements with values (a=0, b="foo") and (a=1, b="bar") respectively.

Configuration

elements Array[?] : The set of elements that should belong to the PCollection. YAML/JSON-style mappings will be interpreted as Beam rows.

Usage

type: AssertEqual
input: ...
config:
  elements:
  - element
  - element
  - ...

AssignTimestamps

Assigns a new timestamp each element of its input.

This can be useful when reading records that have the timestamp embedded in them, for example with various file types or other sources that by default set all timestamps to the infinite past.

Note that the timestamp should only be set forward, as setting it backwards may not cause it to hold back an already advanced watermark and the data could become droppably late.

Supported languages: generic, javascript, python.

Configuration

timestamp ? (Optional) : A field, callable, or expression giving the new timestamp.
language string (Optional) : The language of the timestamp expression.
error_handling Row (Optional) : Whether and how to handle errors during timestamp evaluation.

Row fields:
- output string : Name to use for the output error collection

Usage

type: AssignTimestamps
input: ...
config:
  timestamp: timestamp
  language: "language"
  error_handling:
    output: "output"

Combine

Groups and combines records sharing common fields.

Built-in combine functions are sum, max, min, all, any, mean, count, group, concat but custom aggregation functions can be used as well.

See also the documentation on YAML Aggregation.

Supported languages: calcite, generic, javascript, python, sql.

Configuration

group_by Array[string] : The field(s) to aggregate on.
combine Map[string, Map[string, ?]] : The aggregation function to use.
language string (Optional) : The language used to define (and execute) the custom callables in combine. Defaults to generic.

Usage

type: Combine
input: ...
config:
  group_by:
  - "group_by"
  - "group_by"
  - ...
  combine:
    a:
      a: combine_value_a_value_a
      b: combine_value_a_value_b
      c: ...
    b:
      a: combine_value_b_value_a
      b: combine_value_b_value_b
      c: ...
    c: ...
  language: "language"

Create

Creates a collection containing a specified set of elements.

This transform always produces schema'd data. For example

type: Create
config:
  elements: [1, 2, 3]

will result in an output with three elements with a schema of Row(element=int) whereas YAML/JSON-style mappings will be interpreted directly as Beam rows, e.g.

type: Create
config:
  elements:
     - {first: 0, second: {str: "foo", values: [1, 2, 3]}}
     - {first: 1, second: {str: "bar", values: [4, 5, 6]}}

will result in a schema of the form (int, Row(string, list[int])).

This can also be expressed as YAML

type: Create
config:
  elements:
    - first: 0
      second:
        str: "foo"
         values: [1, 2, 3]
    - first: 1
      second:
        str: "bar"
         values: [4, 5, 6]

Configuration

elements Array[?] : The set of elements that should belong to the PCollection. YAML/JSON-style mappings will be interpreted as Beam rows. Primitives will be mapped to rows with a single "element" field.
reshuffle boolean (Optional) : (optional) Whether to introduce a reshuffle (to possibly redistribute the work) if there is more than one element in the collection. Defaults to True.

Usage

type: Create
config:
  elements:
  - element
  - element
  - ...
  reshuffle: true|false

Enrichment

The Enrichment transform allows one to dynamically enhance elements in a pipeline by performing key-value lookups against external services like APIs or databases.

Example using BigTable:

- type: Enrichment
  config:
    enrichment_handler: 'BigTable'
    handler_config:
      project_id: 'apache-beam-testing'
      instance_id: 'beam-test'
      table_id: 'bigtable-enrichment-test'
      row_key: 'product_id'
    timeout: 30

For more information on Enrichment, see the Beam docs.

Configuration

enrichment_handler string : Specifies the source from where data needs to be extracted into the pipeline for enriching data. One of "BigQuery", "BigTable", "FeastFeatureStore" or "VertexAIFeatureStore".
handler_config Map[string, ?] : Specifies the parameters for the respective enrichment_handler in a YAML/JSON format. To see the full set of handler_config parameters, see their corresponding doc pages:
timeout double (Optional) : Timeout for source requests in seconds. Defaults to 30 seconds.

Usage

type: Enrichment
input: ...
config:
  enrichment_handler: "enrichment_handler"
  handler_config:
    a: handler_config_value_a
    b: handler_config_value_b
    c: ...
  timeout: timeout

Explode

Explodes (aka unnest/flatten) one or more fields producing multiple rows.

Given one or more fields of iterable type, produces multiple rows, one for each value of that field. For example, a row of the form ('a', [1, 2, 3]) would expand to ('a', 1), ('a', 2'), and ('a', 3) when exploded on the second field.

This is akin to a FlatMap when paired with the MapToFields transform.

See more complete documentation on YAML Mapping Functions.

Configuration

fields ? (Optional) : The list of fields to expand.
cross_product boolean (Optional) : If multiple fields are specified, indicates whether the full cross-product of combinations should be produced, or if the first element of the first field corresponds to the first element of the second field, etc. For example, the row (['a', 'b'], [1, 2]) would expand to the four rows ('a', 1), ('a', 2), ('b', 1), and ('b', 2) when cross_product is set to true but only the two rows ('a', 1) and ('b', 2) when it is set to false. Only meaningful (and required) if multiple rows are specified.
error_handling Row (Optional) : Whether and how to handle errors during iteration.

Row fields:
- output string : Name to use for the output error collection

Usage

type: Explode
input: ...
config:
  fields: fields
  cross_product: true|false
  error_handling:
    a: error_handling_value_a
    b: error_handling_value_b
    c: ...

ExtractWindowingInfo

Extracts the implicit windowing information from an element and makes it explicit as field(s) in the element itself.

The following windowing parameter values are supported:

timestamp: The event timestamp of the current element.
window_start: The start of the window iff it is an interval window.
window_end: The (exclusive) end of the window.
window_string: The string representation of the window.
window_type: The type of the window as a string.
winodw_object: The actual window object itself, as a Java or Python object.
pane_info: A schema'd representation of the current pane info, including its index, whether it was the last firing, etc.

As a convenience, a list rather than a mapping of fields may be provided, in which case the fields will be named according to the requested values.

Configuration

fields Map[string, string] (Optional) : A mapping of new field names to various windowing parameters, as documented above. If omitted, defaults to [timestamp, window_start, window_end].

Usage

type: ExtractWindowingInfo
input: ...
config:
  fields:
    a: "fields_value_a"
    b: "fields_value_b"
    c: ...

Filter

Keeps only records that satisfy the given criteria.

See more complete documentation on YAML Filtering.

Supported languages: calcite, generic, java, javascript, python, sql.

Configuration

language string (Optional)
keep Row

Row fields:
- callable string (Optional) : Source code of a public class implementing Function for some schema-compatible T.
- expression string (Optional) : Source code of a java expression in terms of the schema fields.
- name string (Optional) : Fully qualified name of either a class implementing Function (e.g. com.pkg.MyFunction), or a method taking a single Row argument (e.g. com.pkg.MyClass::methodName). If a method is passed, it must either be static or belong to a class with a public nullary constructor.
- path string (Optional) : Path to a jar file implementing the function referenced in name.
error_handling Row (Optional) : This option specifies whether and where to output error rows.

Row fields:
- output string : Name to use for the output error collection

Usage

type: Filter
input: ...
config:
  language: "language"
  keep:
    callable: "callable"
    expression: "expression"
    name: "name"
    path: "path"
  error_handling:
    output: "output"

Flatten

Flattens multiple PCollections into a single PCollection.

The elements of the resulting PCollection will be the (disjoint) union of all the elements of all the inputs.

Note that in YAML transforms can always take a list of inputs which will be implicitly flattened.

Configuration

No configuration parameters.

Usage

type: Flatten
input: ...

Join

Joins two or more inputs using a specified condition.

For example

type: Join
input:
  input1: SomeTransform
  input2: AnotherTransform
  input3: YetAnotherTransform
config:
  type: inner
  equalities:
    - input1: colA
      input2: colB
    - input2: colX
      input3: colY
  fields:
    input1: [colA, colB, colC]
    input2: {new_name: colB}

would perform an inner join on the three inputs satisfying the constraints that input1.colA = input2.colB and input2.colX = input3.colY emitting rows with colA, colB and colC from input1, the values of input2.colB as a field called new_name, and all the fields from input3.

Configuration

equalities ? (Optional) : The condition to join on. A list of sets of columns that should be equal to fulfill the join condition. For the simple scenario of joining on the same column across all inputs where the column name is the same, one can specify the column name as the equality rather than having to list it for every input.
type ? (Optional) : The type of join. Could be a string value in ["inner", "left", "right", "outer"] that specifies the type of join to be performed. For scenarios with multiple inputs to join where different join types are desired, specify the inputs to be outer joined. For example, {outer: [input1, input2]} means that input1 and input2 will be outer joined using the conditions specified, while other inputs will be inner joined.
fields Map[string, ?] (Optional) : The fields to be outputted. A mapping with the input alias as the key and the list of fields in the input to be outputted. The value in the map can either be a dictionary with the new field name as the key and the original field name as the value (e.g new_field_name: field_name), or a list of the fields to be outputted with their original names (e.g [col1, col2, col3]), or an '*' indicating all fields in the input will be outputted. If not specified, all fields from all inputs will be outputted.

Usage

type: Join
input: ...
config:
  equalities: equalities
  type: type
  fields:
    a: fields_value_a
    b: fields_value_b
    c: ...

LogForTesting

Logs each element of its input PCollection.

The output of this transform is a copy of its input for ease of use in chain-style pipelines.

Configuration

level string (Optional) : one of ERROR, INFO, or DEBUG, mapped to a corresponding language-specific logging level
prefix string (Optional) : an optional identifier that will get prepended to the element being logged
error_handling Row (Optional) : This option specifies whether and where to output error rows.

Row fields:
- output string : Name to use for the output error collection

Usage

type: LogForTesting
input: ...
config:
  level: "level"
  prefix: "prefix"
  error_handling:
    output: "output"

MLTransform

Configuration

write_artifact_location string (Optional)
read_artifact_location string (Optional)
transforms Array[?] (Optional)

Usage

type: MLTransform
input: ...
config:
  write_artifact_location: "write_artifact_location"
  read_artifact_location: "read_artifact_location"
  transforms:
  - transforms
  - transforms
  - ...

MapToFields

Creates records with new fields defined in terms of the input fields.

See more complete documentation on YAML Mapping Functions.

Supported languages: calcite, generic, java, javascript, python, sql.

Configuration

fields Map[string, ?] : The output fields to compute, each mapping to the expression or callable that creates them.
append boolean (Optional) : Whether to append the created fields to the set of fields already present, outputting a union of both the new fields and the original fields for each record. Defaults to False.
drop Array[string] (Optional) : If append is true, enumerates a subset of fields from the original record that should not be kept
language string (Optional) : The language used to define (and execute) the expressions and/or callables in fields. Defaults to generic.
dependencies Array[string] (Optional) : An optional list of extra dependencies that are needed for these UDFs. The interpretation of these strings is language-dependent.
error_handling Row (Optional) : Whether and where to output records that throw errors when the above expressions are evaluated.

Row fields:
- output string : Name to use for the output error collection

Usage

type: MapToFields
input: ...
config:
  language: "language"
  append: true|false
  drop:
  - "drop"
  - "drop"
  - ...
  fields:
    a:
      name: "name"
      path: "path"
      expression: "expression"
      callable: "callable"
    b:
      name: "name"
      path: "path"
      expression: "expression"
      callable: "callable"
    c: ...
  error_handling:
    output: "output"

Partition

Splits an input into several distinct outputs.

Each input element will go to a distinct output based on the field or function given in the by configuration parameter.

Supported languages: generic, javascript, python.

Configuration

by ? (Optional) : A field, callable, or expression giving the destination output for this element. Should return a string that is a member of the outputs parameter. If unknown_output is also set, other returns values are accepted as well, otherwise an error will be raised.
outputs Array[string] : The set of outputs into which this input is being partitioned.
unknown_output string (Optional) : (Optional) If set, indicates a destination output for any elements that are not assigned an output listed in the outputs parameter.
error_handling Row (Optional) : (Optional) Whether and how to handle errors during partitioning.

Row fields:
- output string : Name to use for the output error collection
language string : The language of the by expression.

Usage

type: Partition
input: ...
config:
  by: by
  outputs:
  - "outputs"
  - "outputs"
  - ...
  unknown_output: "unknown_output"
  error_handling:
    a: error_handling_value_a
    b: error_handling_value_b
    c: ...
  language: "language"

PyTransform

A Python PTransform identified by fully qualified name.

This allows one to import, construct, and apply any Beam Python transform. This can be useful for using transforms that have not yet been exposed via a YAML interface. Note, however, that conversion may be required if this transform does not accept or produce Beam Rows.

For example

type: PyTransform
config:
   constructor: apache_beam.pkg.mod.SomeClass
   args: [1, 'foo']
   kwargs:
     baz: 3

can be used to access the transform apache_beam.pkg.mod.SomeClass(1, 'foo', baz=3).

See also the documentation on Inlining Python.

Configuration

constructor string : Fully qualified name of a callable used to construct the transform. Often this is a class such as apache_beam.pkg.mod.SomeClass but it can also be a function or any other callable that returns a PTransform.
args Array[?] (Optional) : A list of parameters to pass to the callable as positional arguments.
kwargs Map[string, ?] (Optional) : A list of parameters to pass to the callable as keyword arguments.

Usage

type: PyTransform
input: ...
config:
  constructor: "constructor"
  args:
  - arg
  - arg
  - ...
  kwargs:
    a: kwargs_value_a
    b: kwargs_value_b
    c: ...

RunInference

A transform that takes the input rows, containing examples (or features), for use on an ML model. The transform then appends the inferences (or predictions) for those examples to the input row.

A ModelHandler must be passed to the model_handler parameter. The ModelHandler is responsible for configuring how the ML model will be loaded and how input data will be passed to it. Every ModelHandler has a config tag, similar to how a transform is defined, where the parameters are defined.

For example:

- type: RunInference
  config:
    model_handler:
      type: ModelHandler
      config:
        param_1: arg1
        param_2: arg2
        ...

By default, the RunInference transform will return the input row with a single field appended named by the inference_tag parameter ("inference" by default) that contains the inference directly returned by the underlying ModelHandler, after any optional postprocessing.

For example, if the input had the following:

Row(question="What is a car?")

The output row would look like:

Row(question="What is a car?", inference=...)

where the inference tag can be overridden with the inference_tag parameter.

However, if one specified the following transform config:

- type: RunInference
  config:
    inference_tag: my_inference
    model_handler: ...

The output row would look like:

Row(question="What is a car?", my_inference=...)

See more complete documentation on the underlying RunInference transform.

Preprocessing input data

In most cases, the model will be expecting data in a particular data format, whether it be a Python Dict, PyTorch tensor, etc. However, the outputs of all built-in Beam YAML transforms are Beam Rows. To allow for transforming the Beam Row into a data format the model recognizes, each ModelHandler is equipped with a preprocessing parameter for performing necessary data preprocessing. It is possible for a ModelHandler to define a default preprocessing function, but in most cases, one will need to be specified by the caller.

For example, using callable:

pipeline:
  type: chain

  transforms:
    - type: Create
      config:
        elements:
          - question: "What is a car?"
          - question: "Where is the Eiffel Tower located?"

    - type: RunInference
      config:
        model_handler:
          type: ModelHandler
          config:
            param_1: arg1
            param_2: arg2
            preprocess:
              callable: 'lambda row: {"prompt": row.question}'
            ...

In the above example, the Create transform generates a collection of two Beam Row elements, each with a single field - "question". The model, however, expects a Python Dict with a single key, "prompt". In this case, we can specify a simple Lambda function (alternatively could define a full function), to map the data.

Postprocessing predictions

It is also possible to define a postprocessing function to postprocess the data output by the ModelHandler. See the documentation for the ModelHandler you intend to use (list defined below under model_handler parameter doc).

In many cases, before postprocessing, the object will be a PredictionResult. # pylint: disable=line-too-long This type behaves very similarly to a Beam Row and fields can be accessed using dot notation. However, make sure to check the docs for your ModelHandler to see which fields its PredictionResult contains or if it returns a different object altogether.

For example:

- type: RunInference
  config:
    model_handler:
      type: ModelHandler
      config:
        param_1: arg1
        param_2: arg2
        postprocess:
          callable: |
            def fn(x: PredictionResult):
              return beam.Row(x.example, x.inference, x.model_id)
        ...

The above example demonstrates converting the original output data type (in this case it is PredictionResult), and converts to a Beam Row, which allows for easier mapping in a later transform.

File-based pre/postprocessing functions

For both preprocessing and postprocessing, it is also possible to specify a Python UDF (User-defined function) file that contains the function. This is possible by specifying the path to the file (local file or GCS path) and the name of the function in the file.

For example:

- type: RunInference
  config:
    model_handler:
      type: ModelHandler
      config:
        param_1: arg1
        param_2: arg2
        preprocess:
          path: gs://my-bucket/path/to/preprocess.py
          name: my_preprocess_fn
        postprocess:
          path: gs://my-bucket/path/to/postprocess.py
          name: my_postprocess_fn
        ...

Configuration

model_handler Map[string, ?] : Specifies the parameters for the respective enrichment_handler in a YAML/JSON format. To see the full set of handler_config parameters, see their corresponding doc pages:
- VertexAIModelHandlerJSON # pylint: disable=line-too-long
inference_tag string (Optional) : The tag to use for the returned inference. Default is 'inference'.
inference_args Map[string, ?] (Optional) : Extra arguments for models whose inference call requires extra parameters. Make sure to check the underlying ModelHandler docs to see which args are allowed.

Usage

type: RunInference
input: ...
config:
  model_handler:
    a: model_handler_value_a
    b: model_handler_value_b
    c: ...
  inference_tag: "inference_tag"
  inference_args:
    a: inference_args_value_a
    b: inference_args_value_b
    c: ...

Sql

Configuration

Usage

type: Sql
input: ...
config: ...

StripErrorMetadata

Strips error metadata from outputs returned via error handling.

Generally the error outputs for transformations return information about the error encountered (e.g. error messages and tracebacks) in addition to the failing element itself. This transformation attempts to remove that metadata and returns the bad element alone which can be useful for re-processing.

For example, in the following pipeline snippet

- name: MyMappingTransform
  type: MapToFields
  input: SomeInput
  config:
    language: python
    fields:
      ...
    error_handling:
      output: errors

- name: RecoverOriginalElements
  type: StripErrorMetadata
  input: MyMappingTransform.errors

the output of RecoverOriginalElements will contain exactly those elements from SomeInput that failed to processes (whereas MyMappingTransform.errors would contain those elements paired with error information).

Note that this relies on the preceding transform actually returning the failing input in a schema'd way. Most built-in transformation follow the correct conventions.

Configuration

No configuration parameters.

Usage

type: StripErrorMetadata
input: ...

ValidateWithSchema

Validates each element of a PCollection against a json schema.

Configuration

schema Map[string, ?] : A json schema against which to validate each element.
error_handling Row (Optional) : Whether and how to handle errors during iteration. If this is not set, invalid elements will fail the pipeline, otherwise invalid elements will be passed to the specified error output along with information about how the schema was invalidated.

Row fields:
- output string : Name to use for the output error collection

Usage

type: ValidateWithSchema
input: ...
config:
  schema:
    a: schema_value_a
    b: schema_value_b
    c: ...
  error_handling:
    a: error_handling_value_a
    b: error_handling_value_b
    c: ...

WindowInto

A window transform assigning windows to each element of a PCollection.

The assigned windows will affect all downstream aggregating operations, which will aggregate by window as well as by key.

See the Beam documentation on windowing for more details.

Sizes, offsets, periods and gaps (where applicable) must be defined using a time unit suffix 'ms', 's', 'm', 'h' or 'd' for milliseconds, seconds, minutes, hours or days, respectively. If a time unit is not specified, it will default to 's'.

For example

windowing:
   type: fixed
   size: 30s

Note that any Yaml transform can have a windowing parameter, which is applied to its inputs (if any) or outputs (if there are no inputs) which means that explicit WindowInto operations are not typically needed.

Configuration

windowing ? (Optional) : the type and parameters of the windowing to perform

Usage

type: WindowInto
input: ...
config:
  windowing: windowing

ReadFromAvro

A PTransform for reading records from avro files.

Each record of the resulting PCollection will contain a single record read from a source. Records that are of simple types will be mapped to beam Rows with a single record field containing the records value. Records that are of Avro type RECORD will be mapped to Beam rows that comply with the schema contained in the Avro file that contains those records.

Configuration

path ? (Optional)

Usage

type: ReadFromAvro
config:
  path: path

WriteToAvro

A PTransform for writing avro files.

If the input has a schema, a corresponding avro schema will be automatically generated and used to write the output records.

Configuration

path ? (Optional)

Usage

type: WriteToAvro
input: ...
config:
  path: path

ReadFromBigQuery

Reads data from BigQuery.

Exactly one of table or query must be set. If query is set, neither row_restriction nor fields should be set.

Configuration

query string (Optional) : The SQL query to be executed to read from the BigQuery table.
table string (Optional) : The fully-qualified name of the BigQuery table to read from. Format: [${PROJECT}:]${DATASET}.${TABLE}
fields Array[string] (Optional) : Read only the specified fields (columns) from a BigQuery table. Fields may not be returned in the order specified. If no value is specified, then all fields are returned. Example: "col1, col2, col3"
row_restriction string (Optional) : Read only rows that match this filter, which must be compatible with Google standard SQL. This is not supported when reading via query.

Usage

type: ReadFromBigQuery
config:
  query: "query"
  table: "table"
  fields:
  - "field"
  - "field"
  - ...
  row_restriction: "row_restriction"

WriteToBigQuery

Writes data to BigQuery using the Storage Write API (https://cloud.google.com/bigquery/docs/write-api).

This expects a single PCollection of Beam Rows and outputs two dead-letter queues (DLQ) that contain failed rows. The first DLQ has tag [FailedRows] and contains the failed rows. The second DLQ has tag [FailedRowsWithErrors] and contains failed rows and along with their respective errors.

Configuration

table string : The bigquery table to write to. Format: [${PROJECT}:]${DATASET}.${TABLE}
create_disposition string (Optional) : Optional field that specifies whether the job is allowed to create new tables. The following values are supported: CREATE_IF_NEEDED (the job may create the table), CREATE_NEVER (the job must fail if the table does not exist already).
write_disposition string (Optional) : Specifies the action that occurs if the destination table already exists. The following values are supported: WRITE_TRUNCATE (overwrites the table data), WRITE_APPEND (append the data to the table), WRITE_EMPTY (job must fail if the table is not empty).
error_handling Row (Optional) : This option specifies whether and where to output unwritable rows.

Row fields:
- output string : Name to use for the output error collection
num_streams int32 (Optional) : Specifies the number of write streams that the Storage API sink will use. This parameter is only applicable when writing unbounded data.

Usage

type: WriteToBigQuery
input: ...
config:
  table: "table"
  create_disposition: "create_disposition"
  write_disposition: "write_disposition"
  error_handling:
    output: "output"
  num_streams: num_streams

ReadFromBigTable

Configuration

project string
instance string
table string
flatten boolean (Optional)

Usage

type: ReadFromBigTable
config:
  project: "project"
  instance: "instance"
  table: "table"
  flatten: true|false

WriteToBigTable

Configuration

project string
instance string
table string

Usage

type: WriteToBigTable
input: ...
config:
  project: "project"
  instance: "instance"
  table: "table"

ReadFromCsv

A PTransform for reading comma-separated values (csv) files into a PCollection.

Configuration

path string : The file path to read from. The path can contain glob characters such as * and ?.
delimiter ? (Optional)
comment ? (Optional)
filename_column string (Optional) : If not None, the name of the column to add to each record, containing the filename of the source file.

Usage

type: ReadFromCsv
config:
  path: "path"
  delimiter: delimiter
  comment: comment
  filename_column: "filename_column"

WriteToCsv

A PTransform for writing a schema'd PCollection as a (set of) comma-separated values (csv) files.

Configuration

path string : The file path to write to. The files written will begin with this prefix, followed by a shard identifier (see num_shards) according to the file_naming parameter.
delimiter ? (Optional)

Usage

type: WriteToCsv
input: ...
config:
  delimiter: "delimiter"
  path: "path"

ReadFromIceberg

Reads an Apache Iceberg table.

Configuration

table string : The identifier of the Apache Iceberg table. Example: "db.table1".
filter string (Optional) : SQL-like predicate to filter data at scan time. Example: "id > 5 AND status = 'ACTIVE'". Uses Apache Calcite syntax: https://calcite.apache.org/docs/reference.html
keep Array[string] (Optional) : A subset of column names to read exclusively. If null or empty, all columns will be read.
drop Array[string] (Optional) : A subset of column names to exclude from reading. If null or empty, all columns will be read.
catalog_name string (Optional) : The name of the catalog. Example: "local".
catalog_properties Map[string, string] (Optional) : A map of configuration properties for the Apache Iceberg catalog. The required properties depend on the catalog. For more information, see CatalogUtil in the Apache Iceberg documentation.
config_properties Map[string, string] (Optional) : An optional set of Hadoop configuration properties. For more information, see CatalogUtil in the Apache Iceberg documentation.

Usage

type: ReadFromIceberg
config:
  table: "table"
  filter: "filter"
  keep:
  - "keep"
  - "keep"
  - ...
  drop:
  - "drop"
  - "drop"
  - ...
  catalog_name: "catalog_name"
  catalog_properties:
    a: "catalog_properties_value_a"
    b: "catalog_properties_value_b"
    c: ...
  config_properties:
    a: "config_properties_value_a"
    b: "config_properties_value_b"
    c: ...

WriteToIceberg

Writes to an Apache Iceberg table.

See also the Apache Iceberg Beam documentation including the dynamic destinations section for use of the keep, drop, and only parameters.

Configuration

table string : The identifier of the Apache Iceberg table. Example: "db.table1".
catalog_name string (Optional) : The name of the catalog. Example: "local".
catalog_properties Map[string, string] (Optional) : A map of configuration properties for the Apache Iceberg catalog. The required properties depend on the catalog. For more information, see CatalogUtil in the Apache Iceberg documentation.
config_properties Map[string, string] (Optional) : An optional set of Hadoop configuration properties. For more information, see CatalogUtil in the Apache Iceberg documentation.
partition_fields Array[string] (Optional) : Fields used to create a partition spec that is applied when tables are created. For a field 'foo', the available partition transforms are:
- foo
- truncate(foo, N)
- bucket(foo, N)
- hour(foo)
- day(foo)
- month(foo)
- year(foo)
- void(foo) For more information on partition transforms, please visit https://iceberg.apache.org/spec/#partition-transforms.
table_properties Map[string, string] (Optional) : Iceberg table properties to be set on the table when it is created. For more information on table properties, please visit https://iceberg.apache.org/docs/latest/configuration/#table-properties.
triggering_frequency_seconds int64 (Optional) : For streaming write pipelines, the frequency at which the sink attempts to produce snapshots, in seconds.
keep Array[string] (Optional) : An optional list of field names to keep when writing to the destination. Other fields are dropped. Mutually exclusive with drop and only.
drop Array[string] (Optional) : An optional list of field names to drop before writing to the destination. Mutually exclusive with keep and only.
only string (Optional) : The name of exactly one field to keep as the top level record when writing to the destination. All other fields are dropped. This field must be of row type. Mutually exclusive with drop and keep.

Usage

type: WriteToIceberg
input: ...
config:
  table: "table"
  catalog_name: "catalog_name"
  catalog_properties:
    a: "catalog_properties_value_a"
    b: "catalog_properties_value_b"
    c: ...
  config_properties:
    a: "config_properties_value_a"
    b: "config_properties_value_b"
    c: ...
  partition_fields:
  - "partition_fields"
  - "partition_fields"
  - ...
  table_properties:
    a: "table_properties_value_a"
    b: "table_properties_value_b"
    c: ...
  triggering_frequency_seconds: triggering_frequency_seconds
  keep:
  - "keep"
  - "keep"
  - ...
  drop:
  - "drop"
  - "drop"
  - ...
  only: "only"

ReadFromJdbc

Read from a JDBC source using a SQL query or by directly accessing a single table.

This transform can be used to read from a JDBC source using either a given JDBC driver jar and class name, or by using one of the default packaged drivers given a jdbc_type.

Using a default driver

This transform comes packaged with drivers for several popular JDBC distributions. The following distributions can be declared as the jdbc_type: mysql, oracle, postgres, mssql.

For example, reading a MySQL source using a SQL query:

- type: ReadFromJdbc
  config:
    jdbc_type: mysql
    url: "jdbc:mysql://my-host:3306/database"
    query: "SELECT * FROM table"

Note: See the following transforms which are built on top of this transform and simplify this logic for several popular JDBC distributions:

ReadFromMySql
ReadFromPostgres
ReadFromOracle
ReadFromSqlServer

Declaring custom JDBC drivers

If reading from a JDBC source not listed above, or if it is necessary to use a custom driver not packaged with Beam, one must define a JDBC driver and class name.

For example, reading a MySQL source table:

- type: ReadFromJdbc
  config:
    driver_jars: "path/to/some/jdbc.jar"
    driver_class_name: "com.mysql.jdbc.Driver"
    url: "jdbc:mysql://my-host:3306/database"
    table: "my-table"

Connection Properties

Connection properties are properties sent to the Driver used to connect to the JDBC source. For example, to set the character encoding to UTF-8, one could write:

- type: ReadFromJdbc
  config:
    connectionProperties: "characterEncoding=UTF-8;"
    ...

All properties should be semi-colon-delimited (e.g. "key1=value1;key2=value2;")

Configuration

url string : Connection URL for the JDBC source.
connection_init_sql Array[string] (Optional) : Sets the connection init sql statements used by the Driver. Only MySQL and MariaDB support this.
connection_properties string (Optional) : Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
disable_auto_commit boolean (Optional) : Whether to disable auto commit on read. Defaults to true if not provided. The need for this config varies depending on the database platform. Informix requires this to be set to false while Postgres requires this to be set to true.
driver_class_name string (Optional) : Name of a Java Driver class to use to connect to the JDBC source. For example, "com.mysql.jdbc.Driver".
driver_jars string (Optional) : Comma separated path(s) for the JDBC driver jar(s). This can be a local path or GCS (gs://) path.
fetch_size int32 (Optional) : This method is used to override the size of the data that is going to be fetched and loaded in memory per every database call. It should ONLY be used if the default value throws memory errors.
output_parallelization boolean (Optional) : Whether to reshuffle the resulting PCollection so results are distributed to all workers.
password string (Optional) : Password for the JDBC source.
query string (Optional) : SQL query used to query the JDBC source.
table string (Optional) : Name of the table to read from.
partition_column string (Optional) : Name of a column of numeric type that will be used for partitioning.
num_partitions int32 (Optional) : The number of partitions
type string (Optional) : Type of JDBC source. When specified, an appropriate default Driver will be packaged with the transform. One of mysql, postgres, oracle, or mssql.
username string (Optional) : Username for the JDBC source.

Usage

type: ReadFromJdbc
config:
  url: "url"
  connection_init_sql:
  - "connection_init_sql"
  - "connection_init_sql"
  - ...
  connection_properties: "connection_properties"
  disable_auto_commit: true|false
  driver_class_name: "driver_class_name"
  driver_jars: "driver_jars"
  fetch_size: fetch_size
  output_parallelization: true|false
  password: "password"
  query: "query"
  table: "table"
  partition_column: "partition_column"
  num_partitions: num_partitions
  type: "type"
  username: "username"

WriteToJdbc

Write to a JDBC sink using a SQL query or by directly accessing a single table.

This transform can be used to write to a JDBC sink using either a given JDBC driver jar and class name, or by using one of the default packaged drivers given a jdbc_type.

Using a default driver

This transform comes packaged with drivers for several popular JDBC distributions. The following distributions can be declared as the jdbc_type: mysql, oracle, postgres, mssql.

For example, writing to a MySQL sink using a SQL query:

- type: WriteToJdbc
  config:
    jdbc_type: mysql
    url: "jdbc:mysql://my-host:3306/database"
    query: "INSERT INTO table VALUES(?, ?)"

Note: See the following transforms which are built on top of this transform and simplify this logic for several popular JDBC distributions:

WriteToMySql
WriteToPostgres
WriteToOracle
WriteToSqlServer

Declaring custom JDBC drivers

If writing to a JDBC sink not listed above, or if it is necessary to use a custom driver not packaged with Beam, one must define a JDBC driver and class name.

For example, writing to a MySQL table:

- type: WriteToJdbc
  config:
    driver_jars: "path/to/some/jdbc.jar"
    driver_class_name: "com.mysql.jdbc.Driver"
    url: "jdbc:mysql://my-host:3306/database"
    table: "my-table"

Connection Properties

Connection properties are properties sent to the Driver used to connect to the JDBC source. For example, to set the character encoding to UTF-8, one could write:

- type: WriteToJdbc
  config:
    connectionProperties: "characterEncoding=UTF-8;"
    ...

All properties should be semi-colon-delimited (e.g. "key1=value1;key2=value2;")

Configuration

url string : Connection URL for the JDBC sink.
auto_sharding boolean (Optional) : If true, enables using a dynamically determined number of shards to write.
connection_init_sql Array[string] (Optional) : Sets the connection init sql statements used by the Driver. Only MySQL and MariaDB support this.
connection_properties string (Optional) : Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
driver_class_name string (Optional) : Name of a Java Driver class to use to connect to the JDBC source. For example, "com.mysql.jdbc.Driver".
driver_jars string (Optional) : Comma separated path(s) for the JDBC driver jar(s). This can be a local path or GCS (gs://) path.
password string (Optional) : Password for the JDBC source.
table string (Optional) : Name of the table to write to.
batch_size int64 (Optional)
type string (Optional) : Type of JDBC source. When specified, an appropriate default Driver will be packaged with the transform. One of mysql, postgres, oracle, or mssql.
username string (Optional) : Username for the JDBC source.
query string (Optional) : SQL query used to insert records into the JDBC sink.

Usage

type: WriteToJdbc
input: ...
config:
  url: "url"
  auto_sharding: true|false
  connection_init_sql:
  - "connection_init_sql"
  - "connection_init_sql"
  - ...
  connection_properties: "connection_properties"
  driver_class_name: "driver_class_name"
  driver_jars: "driver_jars"
  password: "password"
  table: "table"
  batch_size: batch_size
  type: "type"
  username: "username"
  query: "query"

ReadFromJson

A PTransform for reading json values from files into a PCollection.

Configuration

path string : The file path to read from. The path can contain glob characters such as * and ?.

Usage

type: ReadFromJson
config:
  path: "path"

WriteToJson

A PTransform for writing a PCollection as json values to files.

Configuration

path string : The file path to write to. The files written will begin with this prefix, followed by a shard identifier (see num_shards) according to the file_naming parameter.

Usage

type: WriteToJson
input: ...
config:
  path: "path"

ReadFromKafka

Configuration

schema string (Optional) : The schema in which the data is encoded in the Kafka topic. For AVRO data, this is a schema defined with AVRO schema syntax (https://avro.apache.org/docs/1.10.2/spec.html#schemas). For JSON data, this is a schema defined with JSON-schema syntax (https://json-schema.org/). If a URL to Confluent Schema Registry is provided, then this field is ignored, and the schema is fetched from Confluent Schema Registry.
consumer_config Map[string, string] (Optional) : A list of key-value pairs that act as configuration parameters for Kafka consumers. Most of these configurations will not be needed, but if you need to customize your Kafka consumer, you may use this. See a detailed list: https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html
format string (Optional) : The encoding format for the data stored in Kafka. Valid options are: RAW,STRING,AVRO,JSON,PROTO
topic string
bootstrap_servers string : A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,...
confluent_schema_registry_url string (Optional)
confluent_schema_registry_subject string (Optional)
auto_offset_reset_config string (Optional) : What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server. (1) earliest: automatically reset the offset to the earliest offset. (2) latest: automatically reset the offset to the latest offset (3) none: throw exception to the consumer if no previous offset is found for the consumer’s group
error_handling Row (Optional) : This option specifies whether and where to output unwritable rows.

Row fields:
- output string : Name to use for the output error collection
file_descriptor_path string (Optional) : The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization.
message_name string (Optional) : The name of the Protocol Buffer message to be used for schema extraction and data conversion.

Usage

type: ReadFromKafka
config:
  schema: "schema"
  consumer_config:
    a: "consumer_config_value_a"
    b: "consumer_config_value_b"
    c: ...
  format: "format"
  topic: "topic"
  bootstrap_servers: "bootstrap_servers"
  confluent_schema_registry_url: "confluent_schema_registry_url"
  confluent_schema_registry_subject: "confluent_schema_registry_subject"
  auto_offset_reset_config: "auto_offset_reset_config"
  error_handling:
    output: "output"
  file_descriptor_path: "file_descriptor_path"
  message_name: "message_name"

WriteToKafka

Configuration

format string : The encoding format for the data stored in Kafka. Valid options are: RAW,JSON,AVRO,PROTO
topic string
bootstrap_servers string : A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. | Format: host1:port1,host2:port2,...
producer_config_updates Map[string, string] (Optional) : A list of key-value pairs that act as configuration parameters for Kafka producers. Most of these configurations will not be needed, but if you need to customize your Kafka producer, you may use this. See a detailed list: https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html
error_handling Row (Optional) : This option specifies whether and where to output unwritable rows.

Row fields:
- output string : Name to use for the output error collection
file_descriptor_path string (Optional) : The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization.
message_name string (Optional) : The name of the Protocol Buffer message to be used for schema extraction and data conversion.
schema string (Optional)

Usage

type: WriteToKafka
input: ...
config:
  format: "format"
  topic: "topic"
  bootstrap_servers: "bootstrap_servers"
  producer_config_updates:
    a: "producer_config_updates_value_a"
    b: "producer_config_updates_value_b"
    c: ...
  error_handling:
    output: "output"
  file_descriptor_path: "file_descriptor_path"
  message_name: "message_name"
  schema: "schema"

ReadFromMySql

Read from a MySQL source using a SQL query or by directly accessing a single table.

This is a special case of ReadFromJdbc that includes the necessary MySQL Driver and classes.

An example of using ReadFromMySql with SQL query:

- type: ReadFromMySql
  config:
    url: "jdbc:mysql://my-host:3306/database"
    query: "SELECT * FROM table"

It is also possible to read a table by specifying a table name. For example, the following configuration will perform a read on an entire table:

- type: ReadFromMySql
  config:
    url: "jdbc:mysql://my-host:3306/database"
    table: "my-table"

Advanced Usage

It might be necessary to use a custom JDBC driver that is not packaged with this transform. If that is the case, see ReadFromJdbc which allows for more custom configuration.

Configuration

url string : Connection URL for the JDBC source.
connection_init_sql Array[string] (Optional) : Sets the connection init sql statements used by the Driver. Only MySQL and MariaDB support this.
connection_properties string (Optional) : Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
disable_auto_commit boolean (Optional) : Whether to disable auto commit on read. Defaults to true if not provided. The need for this config varies depending on the database platform. Informix requires this to be set to false while Postgres requires this to be set to true.
fetch_size int32 (Optional) : This method is used to override the size of the data that is going to be fetched and loaded in memory per every database call. It should ONLY be used if the default value throws memory errors.
output_parallelization boolean (Optional) : Whether to reshuffle the resulting PCollection so results are distributed to all workers.
password string (Optional) : Password for the JDBC source.
query string (Optional) : SQL query used to query the JDBC source.
table string (Optional) : Name of the table to read from.
partition_column string (Optional) : Name of a column of numeric type that will be used for partitioning.
num_partitions int32 (Optional) : The number of partitions
username string (Optional) : Username for the JDBC source.

Usage

type: ReadFromMySql
config:
  url: "url"
  connection_init_sql:
  - "connection_init_sql"
  - "connection_init_sql"
  - ...
  connection_properties: "connection_properties"
  disable_auto_commit: true|false
  fetch_size: fetch_size
  output_parallelization: true|false
  password: "password"
  query: "query"
  table: "table"
  partition_column: "partition_column"
  num_partitions: num_partitions
  username: "username"

WriteToMySql

Write to a MySQL sink using a SQL query or by directly accessing a single table.

This is a special case of WriteToJdbc that includes the necessary MySQL Driver and classes.

An example of using WriteToMySql with SQL query:

- type: WriteToMySql
  config:
    url: "jdbc:mysql://my-host:3306/database"
    query: "INSERT INTO table VALUES(?, ?)"

It is also possible to read a table by specifying a table name. For example, the following configuration will perform a read on an entire table:

- type: WriteToMySql
  config:
    url: "jdbc:mysql://my-host:3306/database"
    table: "my-table"

Advanced Usage

It might be necessary to use a custom JDBC driver that is not packaged with this transform. If that is the case, see WriteToJdbc which allows for more custom configuration.

Configuration

url string : Connection URL for the JDBC sink.
auto_sharding boolean (Optional) : If true, enables using a dynamically determined number of shards to write.
connection_init_sql Array[string] (Optional) : Sets the connection init sql statements used by the Driver. Only MySQL and MariaDB support this.
connection_properties string (Optional) : Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
password string (Optional) : Password for the JDBC source.
table string (Optional) : Name of the table to write to.
batch_size int64 (Optional)
username string (Optional) : Username for the JDBC source.
query string (Optional) : SQL query used to insert records into the JDBC sink.

Usage

type: WriteToMySql
input: ...
config:
  url: "url"
  auto_sharding: true|false
  connection_init_sql:
  - "connection_init_sql"
  - "connection_init_sql"
  - ...
  connection_properties: "connection_properties"
  password: "password"
  table: "table"
  batch_size: batch_size
  username: "username"
  query: "query"

ReadFromOracle

Read from a Oracle source using a SQL query or by directly accessing a single table.

This is a special case of ReadFromJdbc that includes the necessary Oracle Driver and classes.

An example of using ReadFromOracle with SQL query:

- type: ReadFromOracle
  config:
    url: "jdbc:oracle://my-host:1521/database"
    query: "SELECT * FROM table"

It is also possible to read a table by specifying a table name. For example, the following configuration will perform a read on an entire table:

- type: ReadFromOracle
  config:
    url: "jdbc:oracle://my-host:1521/database"
    table: "my-table"

Advanced Usage

It might be necessary to use a custom JDBC driver that is not packaged with this transform. If that is the case, see ReadFromJdbc which allows for more custom configuration.

Configuration

url string : Connection URL for the JDBC source.
connection_properties string (Optional) : Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
disable_auto_commit boolean (Optional) : Whether to disable auto commit on read. Defaults to true if not provided. The need for this config varies depending on the database platform. Informix requires this to be set to false while Postgres requires this to be set to true.
fetch_size int32 (Optional) : This method is used to override the size of the data that is going to be fetched and loaded in memory per every database call. It should ONLY be used if the default value throws memory errors.
output_parallelization boolean (Optional) : Whether to reshuffle the resulting PCollection so results are distributed to all workers.
password string (Optional) : Password for the JDBC source.
query string (Optional) : SQL query used to query the JDBC source.
table string (Optional) : Name of the table to read from.
partition_column string (Optional) : Name of a column of numeric type that will be used for partitioning.
num_partitions int32 (Optional) : The number of partitions
username string (Optional) : Username for the JDBC source.

Usage

type: ReadFromOracle
config:
  url: "url"
  connection_properties: "connection_properties"
  disable_auto_commit: true|false
  fetch_size: fetch_size
  output_parallelization: true|false
  password: "password"
  query: "query"
  table: "table"
  partition_column: "partition_column"
  num_partitions: num_partitions
  username: "username"

WriteToOracle

Write to a Oracle sink using a SQL query or by directly accessing a single table.

This is a special case of WriteToJdbc that includes the necessary Oracle Driver and classes.

An example of using WriteToOracle with SQL query:

- type: WriteToOracle
  config:
    url: "jdbc:oracle://my-host:1521/database"
    query: "INSERT INTO table VALUES(?, ?)"

It is also possible to read a table by specifying a table name. For example, the following configuration will perform a read on an entire table:

- type: WriteToOracle
  config:
    url: "jdbc:oracle://my-host:1521/database"
    table: "my-table"

Advanced Usage

It might be necessary to use a custom JDBC driver that is not packaged with this transform. If that is the case, see WriteToJdbc which allows for more custom configuration.

Configuration

url string : Connection URL for the JDBC sink.
auto_sharding boolean (Optional) : If true, enables using a dynamically determined number of shards to write.
connection_properties string (Optional) : Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
password string (Optional) : Password for the JDBC source.
table string (Optional) : Name of the table to write to.
batch_size int64 (Optional)
username string (Optional) : Username for the JDBC source.
query string (Optional) : SQL query used to insert records into the JDBC sink.

Usage

type: WriteToOracle
input: ...
config:
  url: "url"
  auto_sharding: true|false
  connection_properties: "connection_properties"
  password: "password"
  table: "table"
  batch_size: batch_size
  username: "username"
  query: "query"

ReadFromParquet

A PTransform for reading Parquet files.

Configuration

path ? (Optional)

Usage

type: ReadFromParquet
config:
  path: path

WriteToParquet

A PTransform for writing parquet files.

Configuration

path ? (Optional)

Usage

type: WriteToParquet
input: ...
config:
  path: path

ReadFromPostgres

Read from a Postgres source using a SQL query or by directly accessing a single table.

This is a special case of ReadFromJdbc that includes the necessary Postgres Driver and classes.

An example of using ReadFromPostgres with SQL query:

- type: ReadFromPostgres
  config:
    url: "jdbc:postgresql://my-host:5432/database"
    query: "SELECT * FROM table"

It is also possible to read a table by specifying a table name. For example, the following configuration will perform a read on an entire table:

- type: ReadFromPostgres
  config:
    url: "jdbc:postgresql://my-host:5432/database"
    table: "my-table"

Advanced Usage

It might be necessary to use a custom JDBC driver that is not packaged with this transform. If that is the case, see ReadFromJdbc which allows for more custom configuration.

Configuration

url string : Connection URL for the JDBC source.
connection_properties string (Optional) : Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
disable_auto_commit boolean (Optional) : Whether to disable auto commit on read. Defaults to true if not provided. The need for this config varies depending on the database platform. Informix requires this to be set to false while Postgres requires this to be set to true.
fetch_size int32 (Optional) : This method is used to override the size of the data that is going to be fetched and loaded in memory per every database call. It should ONLY be used if the default value throws memory errors.
output_parallelization boolean (Optional) : Whether to reshuffle the resulting PCollection so results are distributed to all workers.
password string (Optional) : Password for the JDBC source.
query string (Optional) : SQL query used to query the JDBC source.
table string (Optional) : Name of the table to read from.
partition_column string (Optional) : Name of a column of numeric type that will be used for partitioning.
num_partitions int32 (Optional) : The number of partitions
username string (Optional) : Username for the JDBC source.

Usage

type: ReadFromPostgres
config:
  url: "url"
  connection_properties: "connection_properties"
  disable_auto_commit: true|false
  fetch_size: fetch_size
  output_parallelization: true|false
  password: "password"
  query: "query"
  table: "table"
  partition_column: "partition_column"
  num_partitions: num_partitions
  username: "username"

WriteToPostgres

Write to a Postgres sink using a SQL query or by directly accessing a single table.

This is a special case of WriteToJdbc that includes the necessary Postgres Driver and classes.

An example of using WriteToPostgres with SQL query:

- type: WriteToPostgres
  config:
    url: "jdbc:postgresql://my-host:5432/database"
    query: "INSERT INTO table VALUES(?, ?)"

It is also possible to read a table by specifying a table name. For example, the following configuration will perform a read on an entire table:

- type: WriteToPostgres
  config:
    url: "jdbc:postgresql://my-host:5432/database"
    table: "my-table"

Advanced Usage

It might be necessary to use a custom JDBC driver that is not packaged with this transform. If that is the case, see WriteToJdbc which allows for more custom configuration.

Configuration

url string : Connection URL for the JDBC sink.
auto_sharding boolean (Optional) : If true, enables using a dynamically determined number of shards to write.
connection_properties string (Optional) : Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
password string (Optional) : Password for the JDBC source.
table string (Optional) : Name of the table to write to.
batch_size int64 (Optional)
username string (Optional) : Username for the JDBC source.
query string (Optional) : SQL query used to insert records into the JDBC sink.

Usage

type: WriteToPostgres
input: ...
config:
  url: "url"
  auto_sharding: true|false
  connection_properties: "connection_properties"
  password: "password"
  table: "table"
  batch_size: batch_size
  username: "username"
  query: "query"

ReadFromPubSub

Reads messages from Cloud Pub/Sub.

Configuration

topic string (Optional) : Cloud Pub/Sub topic in the form "projects//topics/". If provided, subscription must be None.
subscription string (Optional) : Existing Cloud Pub/Sub subscription to use in the form "projects//subscriptions/". If not specified, a temporary subscription will be created from the specified topic. If provided, topic must be None.
format string : The expected format of the message payload. Currently suported formats are
- RAW: Produces records with a single payload field whose contents are the raw bytes of the pubsub message.
- STRING: Like RAW, but the bytes are decoded as a UTF-8 string.
- AVRO: Parses records with a given Avro schema.
- JSON: Parses records with a given JSON schema.
schema ? (Optional) : Schema specification for the given format.
attributes Array[string] (Optional) : List of attribute keys whose values will be flattened into the output message as additional fields. For example, if the format is raw and attributes is ["a", "b"] then this read will produce elements of the form Row(payload=..., a=..., b=...).
attributes_map string (Optional)
id_attribute string (Optional) : The attribute on incoming Pub/Sub messages to use as a unique record identifier. When specified, the value of this attribute (which can be any string that uniquely identifies the record) will be used for deduplication of messages. If not provided, we cannot guarantee that no duplicate data will be delivered on the Pub/Sub stream. In this case, deduplication of the stream will be strictly best effort.
timestamp_attribute string (Optional) : Message value to use as element timestamp. If None, uses message publishing time as the timestamp.

Timestamp values should be in one of two formats:
- A numerical value representing the number of milliseconds since the Unix epoch.
- A string in RFC 3339 format, UTC timezone. Example: 2015-10-29T23:41:41.123Z. The sub-second component of the timestamp is optional, and digits beyond the first three (i.e., time units smaller than milliseconds) may be ignored.
error_handling Row (Optional) : This option specifies whether and where to output error rows.

Row fields:
- output string : Name to use for the output error collection

Usage

type: ReadFromPubSub
config:
  topic: "topic"
  subscription: "subscription"
  format: "format"
  schema: schema
  attributes:
  - "attribute"
  - "attribute"
  - ...
  attributes_map: "attributes_map"
  id_attribute: "id_attribute"
  timestamp_attribute: "timestamp_attribute"
  error_handling:
    output: "output"

WriteToPubSub

Writes messages to Cloud Pub/Sub.

Configuration

topic string : Cloud Pub/Sub topic in the form "/topics//".
format string : How to format the message payload. Currently suported formats are
- RAW: Expects a message with a single field (excluding attribute-related fields) whose contents are used as the raw bytes of the pubsub message.
- AVRO: Encodes records with a given Avro schema, which may be inferred from the input PCollection schema.
- JSON: Formats records with a given JSON schema, which may be inferred from the input PCollection schema.
schema ? (Optional) : Schema specification for the given format.
attributes Array[string] (Optional) : List of attribute keys whose values will be pulled out as PubSub message attributes. For example, if the format is raw and attributes is ["a", "b"] then elements of the form Row(any_field=..., a=..., b=...) will result in PubSub messages whose payload has the contents of any_field and whose attribute will be populated with the values of a and b.
attributes_map string (Optional)
id_attribute string (Optional) : If set, will set an attribute for each Cloud Pub/Sub message with the given name and a unique value. This attribute can then be used in a ReadFromPubSub PTransform to deduplicate messages.
timestamp_attribute string (Optional) : If set, will set an attribute for each Cloud Pub/Sub message with the given name and the message's publish time as the value.
error_handling Row (Optional) : This option specifies whether and where to output error rows.

Row fields:
- output string : Name to use for the output error collection

Usage

type: WriteToPubSub
input: ...
config:
  topic: "topic"
  format: "format"
  schema: schema
  attributes:
  - "attribute"
  - "attribute"
  - ...
  attributes_map: "attributes_map"
  id_attribute: "id_attribute"
  timestamp_attribute: "timestamp_attribute"
  error_handling:
    output: "output"

ReadFromPubSubLite

Performs a read from Google Pub/Sub Lite.

Note: This provider is deprecated. See Pub/Sub Lite documentation for more information.

Configuration

project string (Optional) : The GCP project where the Pubsub Lite reservation resides. This can be a project number of a project ID.
schema string (Optional) : The schema in which the data is encoded in the Pubsub Lite topic. For AVRO data, this is a schema defined with AVRO schema syntax (https://avro.apache.org/docs/1.10.2/spec.html#schemas). For JSON data, this is a schema defined with JSON-schema syntax (https://json-schema.org/).
format string : The encoding format for the data stored in Pubsub Lite. Valid options are: RAW,AVRO,JSON,PROTO
subscription_name string : The name of the subscription to consume data. This will be concatenated with the project and location parameters to build a full subscription path.
location string : The region or zone where the Pubsub Lite reservation resides.
attributes Array[string] (Optional) : List of attribute keys whose values will be flattened into the output message as additional fields. For example, if the format is RAW and attributes is ["a", "b"] then this read will produce elements of the form Row(payload=..., a=..., b=...)
attribute_map string (Optional) : Name of a field in which to store the full set of attributes associated with this message. For example, if the format is RAW and attribute_map is set to "attrs" then this read will produce elements of the form Row(payload=..., attrs=...) where attrs is a Map type of string to string. If both attributes and attribute_map are set, the overlapping attribute values will be present in both the flattened structure and the attribute map.
attribute_id string (Optional) : The attribute on incoming Pubsub Lite messages to use as a unique record identifier. When specified, the value of this attribute (which can be any string that uniquely identifies the record) will be used for deduplication of messages. If not provided, we cannot guarantee that no duplicate data will be delivered on the Pub/Sub stream. In this case, deduplication of the stream will be strictly best effort.
error_handling Row (Optional) : This option specifies whether and where to output unwritable rows.

Row fields:
- output string : Name to use for the output error collection
file_descriptor_path string (Optional) : The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization.
message_name string (Optional) : The name of the Protocol Buffer message to be used for schema extraction and data conversion.

Usage

type: ReadFromPubSubLite
config:
  project: "project"
  schema: "schema"
  format: "format"
  subscription_name: "subscription_name"
  location: "location"
  attributes:
  - "attribute"
  - "attribute"
  - ...
  attribute_map: "attribute_map"
  attribute_id: "attribute_id"
  error_handling:
    output: "output"
  file_descriptor_path: "file_descriptor_path"
  message_name: "message_name"

WriteToPubSubLite

Performs a write to Google Pub/Sub Lite.

Note: This provider is deprecated. See Pub/Sub Lite documentation for more information.

Configuration

project string : The GCP project where the Pubsub Lite reservation resides. This can be a project number of a project ID.
format string : The encoding format for the data stored in Pubsub Lite. Valid options are: RAW,JSON,AVRO,PROTO
topic_name string : The name of the topic to publish data into. This will be concatenated with the project and location parameters to build a full topic path.
location string : The region or zone where the Pubsub Lite reservation resides.
attributes Array[string] (Optional) : List of attribute keys whose values will be pulled out as Pubsub Lite message attributes. For example, if the format is JSON and attributes is ["a", "b"] then elements of the form Row(any_field=..., a=..., b=...) will result in Pubsub Lite messages whose payload has the contents of any_field and whose attribute will be populated with the values of a and b.
attribute_id string (Optional) : If set, will set an attribute for each Pubsub Lite message with the given name and a unique value. This attribute can then be used in a ReadFromPubSubLite PTransform to deduplicate messages.
error_handling Row (Optional) : This option specifies whether and where to output unwritable rows.

Row fields:
- output string : Name to use for the output error collection
file_descriptor_path string (Optional) : The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization.
message_name string (Optional) : The name of the Protocol Buffer message to be used for schema extraction and data conversion.
schema string (Optional)

Usage

type: WriteToPubSubLite
input: ...
config:
  project: "project"
  format: "format"
  topic_name: "topic_name"
  location: "location"
  attributes:
  - "attribute"
  - "attribute"
  - ...
  attribute_id: "attribute_id"
  error_handling:
    output: "output"
  file_descriptor_path: "file_descriptor_path"
  message_name: "message_name"
  schema: "schema"

ReadFromSpanner

Performs a Bulk read from Google Cloud Spanner using a specified SQL query or by directly accessing a single table and its columns.

Both Query and Read APIs are supported. See more information about reading from Cloud Spanner.

Example configuration for performing a read using a SQL query:

- type: ReadFromSpanner
  config:
    instance_id: 'my-instance-id'
    database_id: 'my-database'
    query: 'SELECT * FROM table'

It is also possible to read a table by specifying a table name and a list of columns. For example, the following configuration will perform a read on an entire table:

- type: ReadFromSpanner
  config:
    instance_id: 'my-instance-id'
    database_id: 'my-database'
    table: 'my-table'
    columns: ['col1', 'col2']

Additionally, to read using a Secondary Index, specify the index name:

- type: ReadFromSpanner
  config:
    instance_id: 'my-instance-id'
    database_id: 'my-database'
    table: 'my-table'
    index: 'my-index'
    columns: ['col1', 'col2']

Advanced Usage

Reads by default use the PartitionQuery API which enforces some limitations on the type of queries that can be used so that the data can be read in parallel. If the query is not supported by the PartitionQuery API, then you can specify a non-partitioned read by setting batching to false.

For example:

- type: ReadFromSpanner
  config:
    batching: false
    ...

Note: See SpannerIO for more advanced information.

Configuration

project string (Optional) : Specifies the GCP project ID.
instance string : Specifies the Cloud Spanner instance.
database string : Specifies the Cloud Spanner database.
table string (Optional) : Specifies the Cloud Spanner table.
query string (Optional) : Specifies the SQL query to execute.
columns Array[string] (Optional) : Specifies the columns to read from the table. This parameter is required when table is specified.
index string (Optional) : Specifies the Index to read from. This parameter can only be specified when using table.
batching boolean (Optional) : Set to false to disable batching. Useful when using a query that is not compatible with the PartitionQuery API. Defaults to true.
error_handling Row (Optional) : This option specifies whether and where to output unwritable rows.

Row fields:
- output string : Name to use for the output error collection

Usage

type: ReadFromSpanner
config:
  project: "project"
  instance: "instance"
  database: "database"
  table: "table"
  query: "query"
  columns:
  - "columns"
  - "columns"
  - ...
  index: "index"
  batching: true|false
  error_handling:
    output: "output"

WriteToSpanner

Performs a bulk write to a Google Cloud Spanner table.

Example configuration for performing a write to a single table:

- type: ReadFromSpanner
  config:
    project_id: 'my-project-id'
    instance_id: 'my-instance-id'
    database_id: 'my-database'
    table: 'my-table'

Note: See SpannerIO for more advanced information.

Configuration

project string (Optional) : Specifies the GCP project.
instance string : Specifies the Cloud Spanner instance.
database string : Specifies the Cloud Spanner database.
table string : Specifies the Cloud Spanner table.
error_handling Row (Optional) : Whether and how to handle write errors.

Row fields:
- output string : Name to use for the output error collection

Usage

type: WriteToSpanner
input: ...
config:
  project: "project"
  instance: "instance"
  database: "database"
  table: "table"
  error_handling:
    output: "output"

ReadFromSqlServer

Read from a SQL Server source using a SQL query or by directly accessing a single table.

This is a special case of ReadFromJdbc that includes the necessary SQL Server Driver and classes.

An example of using ReadFromSqlServer with SQL query:

- type: ReadFromSqlServer
  config:
    url: "jdbc:sqlserver://my-host:1433/database"
    query: "SELECT * FROM table"

It is also possible to read a table by specifying a table name. For example, the following configuration will perform a read on an entire table:

- type: ReadFromSqlServer
  config:
    url: "jdbc:sqlserver://my-host:1433/database"
    table: "my-table"

Advanced Usage

It might be necessary to use a custom JDBC driver that is not packaged with this transform. If that is the case, see ReadFromJdbc which allows for more custom configuration.

Configuration

url string : Connection URL for the JDBC source.
connection_properties string (Optional) : Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
disable_auto_commit boolean (Optional) : Whether to disable auto commit on read. Defaults to true if not provided. The need for this config varies depending on the database platform. Informix requires this to be set to false while Postgres requires this to be set to true.
fetch_size int32 (Optional) : This method is used to override the size of the data that is going to be fetched and loaded in memory per every database call. It should ONLY be used if the default value throws memory errors.
output_parallelization boolean (Optional) : Whether to reshuffle the resulting PCollection so results are distributed to all workers.
password string (Optional) : Password for the JDBC source.
query string (Optional) : SQL query used to query the JDBC source.
table string (Optional) : Name of the table to read from.
partition_column string (Optional) : Name of a column of numeric type that will be used for partitioning.
num_partitions int32 (Optional) : The number of partitions
username string (Optional) : Username for the JDBC source.

Usage

type: ReadFromSqlServer
config:
  url: "url"
  connection_properties: "connection_properties"
  disable_auto_commit: true|false
  fetch_size: fetch_size
  output_parallelization: true|false
  password: "password"
  query: "query"
  table: "table"
  partition_column: "partition_column"
  num_partitions: num_partitions
  username: "username"

WriteToSqlServer

Write to a SQL Server sink using a SQL query or by directly accessing a single table.

This is a special case of WriteToJdbc that includes the necessary SQL Server Driver and classes.

An example of using WriteToSqlServer with SQL query:

- type: WriteToSqlServer
  config:
    url: "jdbc:sqlserver://my-host:1433/database"
    query: "INSERT INTO table VALUES(?, ?)"

It is also possible to read a table by specifying a table name. For example, the following configuration will perform a read on an entire table:

- type: WriteToSqlServer
  config:
    url: "jdbc:sqlserver://my-host:1433/database"
    table: "my-table"

Advanced Usage

It might be necessary to use a custom JDBC driver that is not packaged with this transform. If that is the case, see WriteToJdbc which allows for more custom configuration.

Configuration

url string : Connection URL for the JDBC sink.
auto_sharding boolean (Optional) : If true, enables using a dynamically determined number of shards to write.
connection_properties string (Optional) : Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
password string (Optional) : Password for the JDBC source.
table string (Optional) : Name of the table to write to.
batch_size int64 (Optional)
username string (Optional) : Username for the JDBC source.
query string (Optional) : SQL query used to insert records into the JDBC sink.

Usage

type: WriteToSqlServer
input: ...
config:
  url: "url"
  auto_sharding: true|false
  connection_properties: "connection_properties"
  password: "password"
  table: "table"
  batch_size: batch_size
  username: "username"
  query: "query"

ReadFromTFRecord

Reads data from TFRecord.

Configuration

file_pattern string : A file glob pattern to read TFRecords from.
coder ? (Optional) : Coder used to decode each record.
compression_type string : Used to handle compressed input files. Default value is CompressionTypes.AUTO, in which case the file_path's extension will be used to detect the compression.
validate boolean (Optional) : Boolean flag to verify that the files exist during the pipeline creation time.

Usage

type: ReadFromTFRecord
config:
  file_pattern: "file_pattern"
  coder: coder
  compression_type: "compression_type"
  validate: true|false

WriteToTFRecord

Writes data to TFRecord.

Configuration

file_path_prefix string : The file path to write to. The files written will begin with this prefix, followed by a shard identifier (see num_shards), and end in a common extension, if given by file_name_suffix.
coder ? (Optional) : Coder used to encode each record.
file_name_suffix string (Optional) : Suffix for the files written.
num_shards int64 (Optional) : The number of files (shards) used for output. If not set, the default value will be used.
shard_name_template string (Optional) : A template string containing placeholders for the shard number and shard count. When constructing a filename for a particular shard number, the upper-case letters 'S' and 'N' are replaced with the 0-padded shard number and shard count respectively. This argument can be '' in which case it behaves as if num_shards was set to 1 and only one file will be generated. The default pattern used is '-SSSSS-of-NNNNN' if None is passed as the shard_name_template.
compression_type string : Used to handle compressed output files. Typical value is CompressionTypes.AUTO, in which case the file_path's extension will be used to detect the compression.

Usage

type: WriteToTFRecord
input: ...
config:
  file_path_prefix: "file_path_prefix"
  coder: coder
  file_name_suffix: "file_name_suffix"
  num_shards: num_shards
  shard_name_template: "shard_name_template"
  compression_type: "compression_type"

ReadFromText

Reads lines from a text files.

The resulting PCollection consists of rows with a single string field named "line."

Configuration

path string : The file path to read from. The path can contain glob characters such as * and ?.

Usage

type: ReadFromText
config:
  path: "path"

WriteToText

Writes a PCollection to a (set of) text files(s).

The input must be a PCollection whose schema has exactly one field.

Configuration

path string : The file path to write to. The files written will begin with this prefix, followed by a shard identifier.

Usage

type: WriteToText
input: ...
config:
  path: "path"