Class PubsubSchemaIOProvider

java.lang.Object
org.apache.beam.sdk.io.gcp.pubsub.PubsubSchemaIOProvider
All Implemented Interfaces:
SchemaIOProvider

@Internal @AutoService(SchemaIOProvider.class) public class PubsubSchemaIOProvider extends Object implements SchemaIOProvider
An implementation of SchemaIOProvider for reading and writing JSON/AVRO payloads with PubsubIO.

Schema

The data schema passed to from(String, Row, Schema) must either be of the nested or flat style.

Nested style

If nested structure is used, the required fields included in the Pubsub message model are 'event_timestamp', 'attributes', and 'payload'.

Flat style

If flat structure is used, the required fields include just 'event_timestamp'. Every other field is assumed part of the payload. See PubsubMessageToRow for details.

Configuration

configurationSchema() consists of two attributes, timestampAttributeKey and deadLetterQueue.

timestampAttributeKey

An optional attribute key of the Pubsub message from which to extract the event timestamp. If not specified, the message publish time will be used as event timestamp.

This attribute has to conform to the same requirements as in PubsubIO.Read.withTimestampAttribute(String)

Short version: it has to be either millis since epoch or string in RFC 3339 format.

If the attribute is specified then event timestamps will be extracted from the specified attribute. If it is not specified then message publish timestamp will be used.

deadLetterQueue

deadLetterQueue is an optional topic path which will be used as a dead letter queue.

Messages that cannot be processed will be sent to this topic. If it is not specified then exception will be thrown for errors during processing causing the pipeline to crash.

  • Field Details

    • ATTRIBUTE_MAP_FIELD_TYPE

      public static final Schema.FieldType ATTRIBUTE_MAP_FIELD_TYPE
    • ATTRIBUTE_ARRAY_ENTRY_SCHEMA

      public static final Schema ATTRIBUTE_ARRAY_ENTRY_SCHEMA
    • ATTRIBUTE_ARRAY_FIELD_TYPE

      public static final Schema.FieldType ATTRIBUTE_ARRAY_FIELD_TYPE
  • Constructor Details

    • PubsubSchemaIOProvider

      public PubsubSchemaIOProvider()
  • Method Details

    • identifier

      public String identifier()
      Returns an id that uniquely represents this IO.
      Specified by:
      identifier in interface SchemaIOProvider
    • configurationSchema

      public Schema configurationSchema()
      Returns the expected schema of the configuration object. Note this is distinct from the schema of the data source itself.
      Specified by:
      configurationSchema in interface SchemaIOProvider
    • from

      public org.apache.beam.sdk.io.gcp.pubsub.PubsubSchemaIOProvider.PubsubSchemaIO from(String location, Row configuration, Schema dataSchema)
      Produce a SchemaIO given a String representing the data's location, the schema of the data that resides there, and some IO-specific configuration object.
      Specified by:
      from in interface SchemaIOProvider
    • requiresDataSchema

      public boolean requiresDataSchema()
      Description copied from interface: SchemaIOProvider
      Indicates whether the dataSchema value is necessary.
      Specified by:
      requiresDataSchema in interface SchemaIOProvider
    • isBounded

      public PCollection.IsBounded isBounded()
      Specified by:
      isBounded in interface SchemaIOProvider