Class PTransform<InputT extends PInput,OutputT extends POutput>

java.lang.Object
org.apache.beam.sdk.transforms.PTransform<InputT,OutputT>
Type Parameters:
InputT - the type of the input to this PTransform
OutputT - the type of the output of this PTransform
All Implemented Interfaces:
Serializable, HasDisplayData
Direct Known Subclasses:
AddFields.Inner, AddUuidsTransform, AmqpIO.Read, AmqpIO.Write, AnnotateText, ApproximateCountDistinct.Globally, ApproximateCountDistinct.PerKey, ApproximateDistinct.GloballyDistinct, ApproximateDistinct.PerKeyDistinct, ApproximateUnique.Globally, ApproximateUnique.PerKey, AsJsons, AsJsons.AsJsonsWithFailures, AvroIO.Parse, AvroIO.ParseAll, AvroIO.ParseFiles, AvroIO.Read, AvroIO.ReadAll, AvroIO.ReadFiles, AvroIO.TypedWrite, AvroIO.Write, BeamJoinTransforms.JoinAsLookup, BeamRowToBigtableMutation, BeamSetOperatorRelBase, BigQueryIO.Read, BigQueryIO.TypedRead, BigQueryIO.Write, BigtableIO.Read, BigtableIO.ReadChangeStream, BigtableIO.Write, BigtableIO.WriteWithResults, BigtableRowToBeamRow, BigtableRowToBeamRowFlat, BoundedReadFromUnboundedSource, CassandraIO.Read, CassandraIO.ReadAll, CassandraIO.Write, Cast, CdapIO.Read, CdapIO.Write, ClickHouseIO.Write, CloudVision.AnnotateImagesFromBytes, CloudVision.AnnotateImagesFromBytesWithContext, CloudVision.AnnotateImagesFromGcsUri, CloudVision.AnnotateImagesFromGcsUriWithContext, CoGroup.ExpandCrossProduct, CoGroup.Impl, CoGroupByKey, Combine.Globally, Combine.GloballyAsSingletonView, Combine.GroupedValues, Combine.PerKey, Combine.PerKeyWithHotKeyFanout, CombineAsIterable, ConsoleIO.Write.Unbound, ContextualTextIO.Read, ContextualTextIO.ReadFiles, CosmosIO.Read, Create.OfValueProvider, Create.TimestampedValues, Create.Values, Create.WindowedValues, CreateDataflowView, CreateStream, CreateStreamingSparkView, CreateStreamingSparkView.CreateSparkPCollectionView, CreateTables, CsvIO.Write, CsvIOParse, DataflowGroupByKey, DataframeTransform, DataGeneratorPTransform, DatastoreV1.DeleteEntity, DatastoreV1.DeleteEntityWithSummary, DatastoreV1.DeleteKey, DatastoreV1.DeleteKeyWithSummary, DatastoreV1.Read, DatastoreV1.Write, DatastoreV1.WriteWithSummary, DeadLetteredTransform, DebeziumIO.Read, Deduplicate.KeyedValues, Deduplicate.Values, Deduplicate.WithRepresentativeValues, DicomIO.ReadStudyMetadata, Distinct, Distinct.WithRepresentativeValues, DLPDeidentifyText, DLPInspectText, DLPReidentifyText, DropFields.Inner, DynamoDBIO.Read, DynamoDBIO.Write, ElasticsearchIO.BulkIO, ElasticsearchIO.DocToBulk, ElasticsearchIO.Read, ElasticsearchIO.Write, EntityToRow, ErrorHandler.PTransformErrorHandler.WriteErrorMetrics, FhirIO.Deidentify, FhirIO.ExecuteBundles, FhirIO.Export, FhirIO.Read, FhirIO.Search, FhirIO.Write, FhirIOPatientEverything, FileIO.Match, FileIO.MatchAll, FileIO.ReadMatches, FileIO.Write, FillGaps, Filter, Filter.Inner, FirestoreV1.BatchGetDocuments, FirestoreV1.BatchWriteWithDeadLetterQueue, FirestoreV1.BatchWriteWithSummary, FirestoreV1.ListCollectionIds, FirestoreV1.ListDocuments, FirestoreV1.PartitionQuery, FirestoreV1.RunQuery, FlatMapElements, FlatMapElements.FlatMapWithFailures, Flatten.Iterables, Flatten.PCollections, org.apache.beam.sdk.util.construction.ForwardingPTransform, GenerateSequence, GoogleAdsV19.Read, GoogleAdsV19.ReadAll, Group.AggregateCombiner, Group.CombineGlobally, Group.Global, GroupByKey, GroupIntoBatches, GroupIntoBatches.WithShardedKey, HadoopFormatIO.Read, HadoopFormatIO.Write, HBaseIO.Read, HBaseIO.ReadAll, HBaseIO.Write, HBaseIO.WriteRowMutations, HCatalogIO.Read, HCatalogIO.Write, HL7v2IO.HL7v2Read, HL7v2IO.HL7v2Read.FetchHL7v2Message, HL7v2IO.ListHL7v2Messages, HL7v2IO.Read, HL7v2IO.Read.FetchHL7v2Message, HL7v2IO.Write, IcebergIO.ReadRows, IcebergIO.WriteRows, Impulse, InfluxDbIO.Read, InfluxDbIO.Write, JdbcIO.Read, JdbcIO.ReadAll, JdbcIO.ReadRows, JdbcIO.ReadWithPartitions, JdbcIO.Write, JdbcIO.WriteVoid, JdbcIO.WriteWithResults, JmsIO.Read, JmsIO.Write, Join.FullOuterJoin, Join.Impl, Join.InnerJoin, Join.LeftOuterJoin, Join.RightOuterJoin, JsonIO.Write, JsonToRow.JsonToRowWithErrFn, KafkaCommitOffset, KafkaIO.Read, KafkaIO.ReadSourceDescriptors, KafkaIO.TypedWithoutMetadata, KafkaIO.Write, KafkaIO.WriteRecords, Keys, KinesisIO.Read, KinesisIO.Write, KinesisTransformRegistrar.KinesisReadToBytes, KuduIO.Read, KuduIO.Write, KvSwap, Managed.ManagedTransform, MapElements, MapElements.MapWithFailures, MapKeys, MapValues, MongoDbGridFSIO.Read, MongoDbGridFSIO.Write, MongoDbIO.Read, MongoDbIO.Write, MongoDbTable.DocumentToRow, MongoDbTable.RowToDocument, MqttIO.Read, MqttIO.Write, Neo4jIO.ReadAll, Neo4jIO.WriteUnwind, OrderedEventProcessor, ParDo.MultiOutput, ParDo.SingleOutput, ParquetIO.Parse, ParquetIO.ParseFiles, ParquetIO.Read, ParquetIO.ReadFiles, ParseJsons, ParseJsons.ParseJsonsWithFailures, Partition, PAssert.DefaultConcludeTransform, PAssert.GroupThenAssert, PAssert.GroupThenAssertForSingleton, PAssert.OneSideInputAssert, PeriodicImpulse, PeriodicSequence, PrepareWrite, ProtoFromBytes, ProtoToBytes, PubsubIO.Read, PubsubIO.Write, PubsubLiteWriteSchemaTransformProvider.SetUuidFromPubSubMessage, PubsubUnboundedSink, PubsubUnboundedSource, PulsarIO.Read, PulsarIO.Write, PythonExternalTransform, PythonMap, RabbitMqIO.Read, RabbitMqIO.Write, Read.Bounded, Read.Unbounded, ReadAllViaFileBasedSourceTransform, RecommendationAICreateCatalogItem, RecommendationAIImportCatalogItems, RecommendationAIImportUserEvents, RecommendationAIPredict, RecommendationAIWriteUserEvent, RedisIO.Read, RedisIO.ReadKeyPatterns, RedisIO.Write, RedisIO.WriteStreams, Redistribute.RedistributeArbitrarily, Redistribute.RedistributeByKey, Regex.AllMatches, Regex.Find, Regex.FindAll, Regex.FindKV, Regex.FindName, Regex.FindNameKV, Regex.Matches, Regex.MatchesKV, Regex.MatchesName, Regex.MatchesNameKV, Regex.ReplaceAll, Regex.ReplaceFirst, Regex.Split, ReifyAsIterable, RenameFields.Inner, RequestResponseIO, Reshuffle, Reshuffle.ViaRandomKey, RowToEntity, RunInference, SchemaTransform, Select.Fields, Select.Flattened, SingleStoreIO.Read, SingleStoreIO.ReadWithPartitions, SingleStoreIO.Write, SketchFrequencies.GlobalSketch, SketchFrequencies.PerKeySketch, SnowflakeIO.Read, SnowflakeIO.Write, SnsIO.Write, SolaceIO.Read, SolaceIO.Write, SolrIO.Read, SolrIO.ReadAll, SolrIO.Write, SortValues, SpannerIO.CreateTransaction, SpannerIO.Read, SpannerIO.ReadAll, SpannerIO.ReadChangeStream, SpannerIO.Write, SpannerIO.WriteGrouped, SparkReceiverIO.Read, SplunkIO.Write, SqlTransform, SqsIO.Read, SqsIO.Write, SqsIO.WriteBatches, StorageApiConvertMessages, StorageApiLoads, StorageApiWriteRecordsInconsistent, StorageApiWritesShardedRecords, StorageApiWriteUnshardedRecords, StreamingInserts, StreamingWriteTables, SubscribeTransform, TDigestQuantiles.GlobalDigest, TDigestQuantiles.PerKeyDigest, Tee, TestStream, TextIO.Read, TextIO.ReadAll, TextIO.ReadFiles, TextIO.TypedWrite, TextIO.Write, TextTableProvider.CsvToRow, TextTableProvider.LinesReadConverter, TextTableProvider.LinesWriteConverter, TFRecordIO.Read, TFRecordIO.ReadFiles, TFRecordIO.Write, ThriftIO.ReadFiles, TikaIO.Parse, TikaIO.ParseFiles, ToJson, UuidDeduplicationTransform, Values, VideoIntelligence.AnnotateVideoFromBytes, VideoIntelligence.AnnotateVideoFromBytesWithContext, VideoIntelligence.AnnotateVideoFromUri, VideoIntelligence.AnnotateVideoFromURIWithContext, View.AsIterable, View.AsList, View.AsMap, View.AsMultimap, View.AsSingleton, View.CreatePCollectionView, Wait.OnSignal, Watch.Growth, Window, Window.Assign, WithKeys, WithKeys, WithTimestamps, WordCount.CountWords, WriteFiles, XmlIO.Read, XmlIO.ReadFiles, XmlIO.Write, YamlTransform

public abstract class PTransform<InputT extends PInput,OutputT extends POutput> extends Object implements Serializable, HasDisplayData
A PTransform<InputT, OutputT> is an operation that takes an InputT (some subtype of PInput) and produces an OutputT (some subtype of POutput).

Common PTransforms include root PTransforms like TextIO.Read, Create, processing and conversion operations like ParDo, GroupByKey, CoGroupByKey, Combine, and Count, and outputting PTransforms like TextIO.Write. Users also define their own application-specific composite PTransforms.

Each PTransform<InputT, OutputT> has a single InputT type and a single OutputT type. Many PTransforms conceptually transform one input value to one output value, and in this case InputT and Output are typically instances of PCollection. A root PTransform conceptually has no input; in this case, conventionally a PBegin object produced by calling Pipeline.begin() is used as the input. An outputting PTransform conceptually has no output; in this case, conventionally PDone is used as its output type. Some PTransforms conceptually have multiple inputs and/or outputs; in these cases special "bundling" classes like PCollectionList, PCollectionTuple are used to combine multiple values into a single bundle for passing into or returning from the PTransform.

A PTransform<InputT, OutputT> is invoked by calling apply() on its InputT, returning its OutputT. Calls can be chained to concisely create linear pipeline segments. For example:


 PCollection<T1> pc1 = ...;
 PCollection<T2> pc2 =
     pc1.apply(ParDo.of(new MyDoFn<T1,KV<K,V>>()))
        .apply(GroupByKey.<K, V>create())
        .apply(Combine.perKey(new MyKeyedCombineFn<K,V>()))
        .apply(ParDo.of(new MyDoFn2<KV<K,V>,T2>()));
 

PTransform operations have unique names, which are used by the system when explaining what's going on during optimization and execution. Each PTransform gets a system-provided default name, but it's a good practice to specify a more informative explicit name when applying the transform. For example:


 ...
 .apply("Step1", ParDo.of(new MyDoFn3()))
 ...
 

Each PCollection output produced by a PTransform, either directly or within a "bundling" class, automatically gets its own name derived from the name of its producing PTransform.

Each PCollection output produced by a PTransform also records a Coder that specifies how the elements of that PCollection are to be encoded as a byte string, if necessary. The PTransform may provide a default Coder for any of its outputs, for instance by deriving it from the PTransform input's Coder. If the PTransform does not specify the Coder for an output PCollection, the system will attempt to infer a Coder for it, based on what's known at run-time about the Java type of the output's elements. The enclosing Pipeline's CoderRegistry (accessible via Pipeline.getCoderRegistry()) defines the mapping from Java types to the default Coder to use, for a standard set of Java types; users can extend this mapping for additional types, via CoderRegistry.registerCoderProvider(org.apache.beam.sdk.coders.CoderProvider). If this inference process fails, either because the Java type was not known at run-time (e.g., due to Java's "erasure" of generic types) or there was no default Coder registered, then the Coder should be specified manually by calling PCollection.setCoder(org.apache.beam.sdk.coders.Coder<T>) on the output PCollection. The Coder of every output PCollection must be determined one way or another before that output is used as an input to another PTransform, or before the enclosing Pipeline is run.

A small number of PTransforms are implemented natively by the Apache Beam SDK; such PTransforms simply return an output value as their apply implementation. The majority of PTransforms are implemented as composites of other PTransforms. Such a PTransform subclass typically just implements expand(InputT), computing its Output value from its InputT value. User programs are encouraged to use this mechanism to modularize their own code. Such composite abstractions get their own name, and navigating through the composition hierarchy of PTransforms is supported by the monitoring interface. Examples of composite PTransforms can be found in this directory and in examples. From the caller's point of view, there is no distinction between a PTransform implemented natively and one implemented in terms of other PTransforms; both kinds of PTransform are invoked in the same way, using apply().

Note on Serialization

PTransform doesn't actually support serialization, despite implementing Serializable.

PTransform is marked Serializable solely because it is common for an anonymous DoFn, instance to be created within an apply() method of a composite PTransform.

Each of those *Fns is Serializable, but unfortunately its instance state will contain a reference to the enclosing PTransform instance, and so attempt to serialize the PTransform instance, even though the *Fn instance never references anything about the enclosing PTransform.

To allow such anonymous *Fns to be written conveniently, PTransform is marked as Serializable, and includes dummy writeObject() and readObject() operations that do not save or restore any state.

See Also:
  • Field Details

  • Constructor Details

    • PTransform

      protected PTransform()
    • PTransform

      protected PTransform(@Nullable String name)
  • Method Details

    • expand

      public abstract OutputT expand(InputT input)
      Override this method to specify how this PTransform should be expanded on the given InputT.

      NOTE: This method should not be called directly. Instead apply the PTransform should be applied to the InputT using the apply method.

      Composite transforms, which are defined in terms of other transforms, should return the output of one of the composed transforms. Non-composite transforms, which do not apply any transforms internally, should return a new unbound output and register evaluators (via backend-specific registration methods).

    • validate

      public void validate(@Nullable PipelineOptions options)
      Called before running the Pipeline to verify this transform is fully and correctly specified.

      By default, does nothing.

    • validate

      public void validate(@Nullable PipelineOptions options, Map<TupleTag<?>,PCollection<?>> inputs, Map<TupleTag<?>,PCollection<?>> outputs)
      Called before running the Pipeline to verify this transform, its inputs, and outputs are fully and correctly specified.

      By default, delegates to validate(PipelineOptions).

    • getAdditionalInputs

      public Map<TupleTag<?>,PValue> getAdditionalInputs()
      Returns all PValues that are consumed as inputs to this PTransform that are independent of the expansion of the PTransform within expand(PInput).

      For example, this can contain any side input consumed by this PTransform.

    • getName

      public String getName()
      Returns the transform name.

      This name is provided by the transform creator and is not required to be unique.

    • setResourceHints

      public PTransform<InputT,OutputT> setResourceHints(@NonNull ResourceHints resourceHints)
      Sets resource hints for the transform.
      Parameters:
      resourceHints - a ResourceHints instance.
      Returns:
      a reference to the same transfrom instance.

      For example:

      
       Pipeline p = ...
       ...
       p.apply(new SomeTransform().setResourceHints(ResourceHints.create().withMinRam("6 GiB")))
       ...
      
       
    • getResourceHints

      public ResourceHints getResourceHints()
      Returns resource hints set on the transform.
    • setDisplayData

      public PTransform<InputT,OutputT> setDisplayData(@NonNull List<DisplayData.ItemSpec<?>> displayData)
      Set display data for your PTransform.
      Parameters:
      displayData - a list of DisplayData.ItemSpec instances.
      Returns:
      a reference to the same transfrom instance.

      For example:

      
       Pipeline p = ...
       ...
       p.apply(new SomeTransform().setDisplayData(ImmutableList.of(DisplayData.item("userFn", userFn.getClass())))
       ...
      
       
    • getAnnotations

      public Map<String,byte[]> getAnnotations()
      Returns annotations map to provide additional hints to the runner.
    • addAnnotation

      public PTransform<InputT,OutputT> addAnnotation(@NonNull String annotationType, byte @NonNull [] annotation)
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • getKindString

      protected String getKindString()
      Returns the name to use by default for this PTransform (not including the names of any enclosing PTransforms).

      By default, returns the base name of this PTransform's class.

      The caller is responsible for ensuring that names of applied PTransforms are unique, e.g., by adding a uniquifying suffix when needed.

    • getDefaultOutputCoder

      @Deprecated protected Coder<?> getDefaultOutputCoder() throws CannotProvideCoderException
      Deprecated.
      Instead, the PTransform should explicitly call PCollection.setCoder(org.apache.beam.sdk.coders.Coder<T>) on the returned PCollection.
      Returns the default Coder to use for the output of this single-output PTransform.

      By default, always throws

      Throws:
      CannotProvideCoderException - if no coder can be inferred
    • getDefaultOutputCoder

      @Deprecated protected Coder<?> getDefaultOutputCoder(InputT input) throws CannotProvideCoderException
      Deprecated.
      Instead, the PTransform should explicitly call PCollection.setCoder(org.apache.beam.sdk.coders.Coder<T>) on the returned PCollection.
      Returns the default Coder to use for the output of this single-output PTransform when applied to the given input.

      By default, always throws.

      Throws:
      CannotProvideCoderException - if none can be inferred.
    • getDefaultOutputCoder

      @Deprecated public <T> Coder<T> getDefaultOutputCoder(InputT input, PCollection<T> output) throws CannotProvideCoderException
      Deprecated.
      Instead, the PTransform should explicitly call PCollection.setCoder(org.apache.beam.sdk.coders.Coder<T>) on the returned PCollection.
      Returns the default Coder to use for the given output of this single-output PTransform when applied to the given input.

      By default, always throws.

      Throws:
      CannotProvideCoderException - if none can be inferred.
    • populateDisplayData

      public void populateDisplayData(DisplayData.Builder builder)
      Register display data for the given transform or component.

      populateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect display data via DisplayData.from(HasDisplayData). Implementations may call super.populateDisplayData(builder) in order to register display data in the current namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use the namespace of the subcomponent.

      By default, does not register any display data. Implementors may override this method to provide their own display data.

      Specified by:
      populateDisplayData in interface HasDisplayData
      Parameters:
      builder - The builder to populate with display data.
      See Also:
    • compose

      public static <InputT extends PInput, OutputT extends POutput> PTransform<InputT,OutputT> compose(SerializableFunction<InputT,OutputT> fn)
      For a SerializableFunction<InputT, OutputT> fn, returns a PTransform given by applying fn.apply(v) to the input PCollection<InputT>.

      Allows users to define a concise composite transform using a Java 8 lambda expression. For example:

      
       PCollection<String> words = wordsAndErrors.apply(
         (PCollectionTuple input) -> {
           input.get(errorsTag).apply(new WriteErrorOutput());
           return input.get(wordsTag);
         });
       
    • compose

      public static <InputT extends PInput, OutputT extends POutput> PTransform<InputT,OutputT> compose(String name, SerializableFunction<InputT,OutputT> fn)
      Like compose(SerializableFunction), but with a custom name.