Class PTransform<InputT extends PInput,OutputT extends POutput>
- Type Parameters:
InputT
- the type of the input to this PTransformOutputT
- the type of the output of this PTransform
- All Implemented Interfaces:
Serializable
,HasDisplayData
- Direct Known Subclasses:
AddFields.Inner
,AddUuidsTransform
,AmqpIO.Read
,AmqpIO.Write
,AnnotateText
,ApproximateCountDistinct.Globally
,ApproximateCountDistinct.PerKey
,ApproximateDistinct.GloballyDistinct
,ApproximateDistinct.PerKeyDistinct
,ApproximateUnique.Globally
,ApproximateUnique.PerKey
,AsJsons
,AsJsons.AsJsonsWithFailures
,AvroIO.Parse
,AvroIO.ParseAll
,AvroIO.ParseFiles
,AvroIO.Read
,AvroIO.ReadAll
,AvroIO.ReadFiles
,AvroIO.TypedWrite
,AvroIO.Write
,BeamJoinTransforms.JoinAsLookup
,BeamRowToBigtableMutation
,BeamSetOperatorRelBase
,BigQueryIO.Read
,BigQueryIO.TypedRead
,BigQueryIO.Write
,BigtableIO.Read
,BigtableIO.ReadChangeStream
,BigtableIO.Write
,BigtableIO.WriteWithResults
,BigtableRowToBeamRow
,BigtableRowToBeamRowFlat
,BoundedReadFromUnboundedSource
,CassandraIO.Read
,CassandraIO.ReadAll
,CassandraIO.Write
,Cast
,CdapIO.Read
,CdapIO.Write
,ClickHouseIO.Write
,CloudVision.AnnotateImagesFromBytes
,CloudVision.AnnotateImagesFromBytesWithContext
,CloudVision.AnnotateImagesFromGcsUri
,CloudVision.AnnotateImagesFromGcsUriWithContext
,CoGroup.ExpandCrossProduct
,CoGroup.Impl
,CoGroupByKey
,Combine.Globally
,Combine.GloballyAsSingletonView
,Combine.GroupedValues
,Combine.PerKey
,Combine.PerKeyWithHotKeyFanout
,CombineAsIterable
,ConsoleIO.Write.Unbound
,ContextualTextIO.Read
,ContextualTextIO.ReadFiles
,CosmosIO.Read
,Create.OfValueProvider
,Create.TimestampedValues
,Create.Values
,Create.WindowedValues
,CreateDataflowView
,CreateStream
,CreateStreamingSparkView
,CreateStreamingSparkView.CreateSparkPCollectionView
,CreateTables
,CsvIO.Write
,CsvIOParse
,DataflowGroupByKey
,DataframeTransform
,DataGeneratorPTransform
,DatastoreV1.DeleteEntity
,DatastoreV1.DeleteEntityWithSummary
,DatastoreV1.DeleteKey
,DatastoreV1.DeleteKeyWithSummary
,DatastoreV1.Read
,DatastoreV1.Write
,DatastoreV1.WriteWithSummary
,DeadLetteredTransform
,DebeziumIO.Read
,Deduplicate.KeyedValues
,Deduplicate.Values
,Deduplicate.WithRepresentativeValues
,DicomIO.ReadStudyMetadata
,Distinct
,Distinct.WithRepresentativeValues
,DLPDeidentifyText
,DLPInspectText
,DLPReidentifyText
,DropFields.Inner
,DynamoDBIO.Read
,DynamoDBIO.Write
,ElasticsearchIO.BulkIO
,ElasticsearchIO.DocToBulk
,ElasticsearchIO.Read
,ElasticsearchIO.Write
,EntityToRow
,ErrorHandler.PTransformErrorHandler.WriteErrorMetrics
,FhirIO.Deidentify
,FhirIO.ExecuteBundles
,FhirIO.Export
,FhirIO.Read
,FhirIO.Search
,FhirIO.Write
,FhirIOPatientEverything
,FileIO.Match
,FileIO.MatchAll
,FileIO.ReadMatches
,FileIO.Write
,FillGaps
,Filter
,Filter.Inner
,FirestoreV1.BatchGetDocuments
,FirestoreV1.BatchWriteWithDeadLetterQueue
,FirestoreV1.BatchWriteWithSummary
,FirestoreV1.ListCollectionIds
,FirestoreV1.ListDocuments
,FirestoreV1.PartitionQuery
,FirestoreV1.RunQuery
,FlatMapElements
,FlatMapElements.FlatMapWithFailures
,Flatten.Iterables
,Flatten.PCollections
,org.apache.beam.sdk.util.construction.ForwardingPTransform
,GenerateSequence
,GoogleAdsV19.Read
,GoogleAdsV19.ReadAll
,Group.AggregateCombiner
,Group.CombineGlobally
,Group.Global
,GroupByKey
,GroupIntoBatches
,GroupIntoBatches.WithShardedKey
,HadoopFormatIO.Read
,HadoopFormatIO.Write
,HBaseIO.Read
,HBaseIO.ReadAll
,HBaseIO.Write
,HBaseIO.WriteRowMutations
,HCatalogIO.Read
,HCatalogIO.Write
,HL7v2IO.HL7v2Read
,HL7v2IO.HL7v2Read.FetchHL7v2Message
,HL7v2IO.ListHL7v2Messages
,HL7v2IO.Read
,HL7v2IO.Read.FetchHL7v2Message
,HL7v2IO.Write
,IcebergIO.ReadRows
,IcebergIO.WriteRows
,Impulse
,InfluxDbIO.Read
,InfluxDbIO.Write
,JdbcIO.Read
,JdbcIO.ReadAll
,JdbcIO.ReadRows
,JdbcIO.ReadWithPartitions
,JdbcIO.Write
,JdbcIO.WriteVoid
,JdbcIO.WriteWithResults
,JmsIO.Read
,JmsIO.Write
,Join.FullOuterJoin
,Join.Impl
,Join.InnerJoin
,Join.LeftOuterJoin
,Join.RightOuterJoin
,JsonIO.Write
,JsonToRow.JsonToRowWithErrFn
,KafkaCommitOffset
,KafkaIO.Read
,KafkaIO.ReadSourceDescriptors
,KafkaIO.TypedWithoutMetadata
,KafkaIO.Write
,KafkaIO.WriteRecords
,Keys
,KinesisIO.Read
,KinesisIO.Write
,KinesisTransformRegistrar.KinesisReadToBytes
,KuduIO.Read
,KuduIO.Write
,KvSwap
,Managed.ManagedTransform
,MapElements
,MapElements.MapWithFailures
,MapKeys
,MapValues
,MongoDbGridFSIO.Read
,MongoDbGridFSIO.Write
,MongoDbIO.Read
,MongoDbIO.Write
,MongoDbTable.DocumentToRow
,MongoDbTable.RowToDocument
,MqttIO.Read
,MqttIO.Write
,Neo4jIO.ReadAll
,Neo4jIO.WriteUnwind
,OrderedEventProcessor
,ParDo.MultiOutput
,ParDo.SingleOutput
,ParquetIO.Parse
,ParquetIO.ParseFiles
,ParquetIO.Read
,ParquetIO.ReadFiles
,ParseJsons
,ParseJsons.ParseJsonsWithFailures
,Partition
,PAssert.DefaultConcludeTransform
,PAssert.GroupThenAssert
,PAssert.GroupThenAssertForSingleton
,PAssert.OneSideInputAssert
,PeriodicImpulse
,PeriodicSequence
,PrepareWrite
,ProtoFromBytes
,ProtoToBytes
,PubsubIO.Read
,PubsubIO.Write
,PubsubLiteWriteSchemaTransformProvider.SetUuidFromPubSubMessage
,PubsubUnboundedSink
,PubsubUnboundedSource
,PulsarIO.Read
,PulsarIO.Write
,PythonExternalTransform
,PythonMap
,RabbitMqIO.Read
,RabbitMqIO.Write
,Read.Bounded
,Read.Unbounded
,ReadAllViaFileBasedSourceTransform
,RecommendationAICreateCatalogItem
,RecommendationAIImportCatalogItems
,RecommendationAIImportUserEvents
,RecommendationAIPredict
,RecommendationAIWriteUserEvent
,RedisIO.Read
,RedisIO.ReadKeyPatterns
,RedisIO.Write
,RedisIO.WriteStreams
,Redistribute.RedistributeArbitrarily
,Redistribute.RedistributeByKey
,Regex.AllMatches
,Regex.Find
,Regex.FindAll
,Regex.FindKV
,Regex.FindName
,Regex.FindNameKV
,Regex.Matches
,Regex.MatchesKV
,Regex.MatchesName
,Regex.MatchesNameKV
,Regex.ReplaceAll
,Regex.ReplaceFirst
,Regex.Split
,ReifyAsIterable
,RenameFields.Inner
,RequestResponseIO
,Reshuffle
,Reshuffle.ViaRandomKey
,RowToEntity
,RunInference
,SchemaTransform
,Select.Fields
,Select.Flattened
,SingleStoreIO.Read
,SingleStoreIO.ReadWithPartitions
,SingleStoreIO.Write
,SketchFrequencies.GlobalSketch
,SketchFrequencies.PerKeySketch
,SnowflakeIO.Read
,SnowflakeIO.Write
,SnsIO.Write
,SolaceIO.Read
,SolaceIO.Write
,SolrIO.Read
,SolrIO.ReadAll
,SolrIO.Write
,SortValues
,SpannerIO.CreateTransaction
,SpannerIO.Read
,SpannerIO.ReadAll
,SpannerIO.ReadChangeStream
,SpannerIO.Write
,SpannerIO.WriteGrouped
,SparkReceiverIO.Read
,SplunkIO.Write
,SqlTransform
,SqsIO.Read
,SqsIO.Write
,SqsIO.WriteBatches
,StorageApiConvertMessages
,StorageApiLoads
,StorageApiWriteRecordsInconsistent
,StorageApiWritesShardedRecords
,StorageApiWriteUnshardedRecords
,StreamingInserts
,StreamingWriteTables
,SubscribeTransform
,TDigestQuantiles.GlobalDigest
,TDigestQuantiles.PerKeyDigest
,Tee
,TestStream
,TextIO.Read
,TextIO.ReadAll
,TextIO.ReadFiles
,TextIO.TypedWrite
,TextIO.Write
,TextTableProvider.CsvToRow
,TextTableProvider.LinesReadConverter
,TextTableProvider.LinesWriteConverter
,TFRecordIO.Read
,TFRecordIO.ReadFiles
,TFRecordIO.Write
,ThriftIO.ReadFiles
,TikaIO.Parse
,TikaIO.ParseFiles
,ToJson
,UuidDeduplicationTransform
,Values
,VideoIntelligence.AnnotateVideoFromBytes
,VideoIntelligence.AnnotateVideoFromBytesWithContext
,VideoIntelligence.AnnotateVideoFromUri
,VideoIntelligence.AnnotateVideoFromURIWithContext
,View.AsIterable
,View.AsList
,View.AsMap
,View.AsMultimap
,View.AsSingleton
,View.CreatePCollectionView
,Wait.OnSignal
,Watch.Growth
,Window
,Window.Assign
,WithKeys
,WithKeys
,WithTimestamps
,WordCount.CountWords
,WriteFiles
,XmlIO.Read
,XmlIO.ReadFiles
,XmlIO.Write
,YamlTransform
PTransform<InputT, OutputT>
is an operation that takes an InputT
(some subtype
of PInput
) and produces an OutputT
(some subtype of POutput
).
Common PTransforms include root PTransforms like TextIO.Read
,
Create
, processing and conversion operations like ParDo
, GroupByKey
,
CoGroupByKey
, Combine
, and Count
, and
outputting PTransforms like TextIO.Write
. Users also define their
own application-specific composite PTransforms.
Each PTransform<InputT, OutputT>
has a single InputT
type and a single
OutputT
type. Many PTransforms conceptually transform one input value to one output value, and
in this case InputT
and Output
are typically instances of PCollection
. A root PTransform conceptually has no input; in this
case, conventionally a PBegin
object produced by calling
Pipeline.begin()
is used as the input. An outputting PTransform conceptually has no output;
in this case, conventionally PDone
is used as its output type.
Some PTransforms conceptually have multiple inputs and/or outputs; in these cases special
"bundling" classes like PCollectionList
, PCollectionTuple
are used to combine multiple values into a single
bundle for passing into or returning from the PTransform.
A PTransform<InputT, OutputT>
is invoked by calling apply()
on its
InputT
, returning its OutputT
. Calls can be chained to concisely create linear pipeline
segments. For example:
PCollection<T1> pc1 = ...;
PCollection<T2> pc2 =
pc1.apply(ParDo.of(new MyDoFn<T1,KV<K,V>>()))
.apply(GroupByKey.<K, V>create())
.apply(Combine.perKey(new MyKeyedCombineFn<K,V>()))
.apply(ParDo.of(new MyDoFn2<KV<K,V>,T2>()));
PTransform operations have unique names, which are used by the system when explaining what's going on during optimization and execution. Each PTransform gets a system-provided default name, but it's a good practice to specify a more informative explicit name when applying the transform. For example:
...
.apply("Step1", ParDo.of(new MyDoFn3()))
...
Each PCollection output produced by a PTransform, either directly or within a "bundling" class, automatically gets its own name derived from the name of its producing PTransform.
Each PCollection output produced by a PTransform also records a Coder
that specifies how the elements of that PCollection are to be
encoded as a byte string, if necessary. The PTransform may provide a default Coder for any of its
outputs, for instance by deriving it from the PTransform input's Coder. If the PTransform does
not specify the Coder for an output PCollection, the system will attempt to infer a Coder for it,
based on what's known at run-time about the Java type of the output's elements. The enclosing
Pipeline
's CoderRegistry
(accessible via Pipeline.getCoderRegistry()
) defines the mapping from Java types to the default Coder to use, for
a standard set of Java types; users can extend this mapping for additional types, via CoderRegistry.registerCoderProvider(org.apache.beam.sdk.coders.CoderProvider)
. If this inference process fails,
either because the Java type was not known at run-time (e.g., due to Java's "erasure" of generic
types) or there was no default Coder registered, then the Coder should be specified manually by
calling PCollection.setCoder(org.apache.beam.sdk.coders.Coder<T>)
on the output PCollection. The Coder of every output
PCollection must be determined one way or another before that output is used as an input to
another PTransform, or before the enclosing Pipeline is run.
A small number of PTransforms are implemented natively by the Apache Beam SDK; such
PTransforms simply return an output value as their apply implementation. The majority of
PTransforms are implemented as composites of other PTransforms. Such a PTransform subclass
typically just implements expand(InputT)
, computing its Output value from its InputT
value. User programs are encouraged to use this mechanism to modularize their own code. Such
composite abstractions get their own name, and navigating through the composition hierarchy of
PTransforms is supported by the monitoring interface. Examples of composite PTransforms can be
found in this directory and in examples. From the caller's point of view, there is no distinction
between a PTransform implemented natively and one implemented in terms of other PTransforms; both
kinds of PTransform are invoked in the same way, using apply()
.
Note on Serialization
PTransform
doesn't actually support serialization, despite implementing
Serializable
.
PTransform
is marked Serializable
solely because it is common for an anonymous
DoFn
, instance to be created within an apply()
method of a composite
PTransform
.
Each of those *Fn
s is Serializable
, but unfortunately its instance state will
contain a reference to the enclosing PTransform
instance, and so attempt to serialize the
PTransform
instance, even though the *Fn
instance never references anything about
the enclosing PTransform
.
To allow such anonymous *Fn
s to be written conveniently, PTransform
is marked
as Serializable
, and includes dummy writeObject()
and readObject()
operations that do not save or restore any state.
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected @NonNull List
<DisplayData.ItemSpec<?>> The base name of thisPTransform
, e.g., from defaults, ornull
if not yet assigned.protected @NonNull ResourceHints
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionaddAnnotation
(@NonNull String annotationType, byte @NonNull [] annotation) static <InputT extends PInput,
OutputT extends POutput>
PTransform<InputT, OutputT> compose
(String name, SerializableFunction<InputT, OutputT> fn) Likecompose(SerializableFunction)
, but with a custom name.static <InputT extends PInput,
OutputT extends POutput>
PTransform<InputT, OutputT> compose
(SerializableFunction<InputT, OutputT> fn) For aSerializableFunction<InputT, OutputT>
fn
, returns aPTransform
given by applyingfn.apply(v)
to the inputPCollection<InputT>
.abstract OutputT
Override this method to specify how thisPTransform
should be expanded on the givenInputT
.Returns allPValues
that are consumed as inputs to thisPTransform
that are independent of the expansion of thePTransform
withinexpand(PInput)
.Returns annotations map to provide additional hints to the runner.protected Coder
<?> Deprecated.protected Coder
<?> getDefaultOutputCoder
(InputT input) Deprecated.Instead, the PTransform should explicitly callPCollection.setCoder(org.apache.beam.sdk.coders.Coder<T>)
on the returned PCollection.<T> Coder
<T> getDefaultOutputCoder
(InputT input, PCollection<T> output) Deprecated.Instead, the PTransform should explicitly callPCollection.setCoder(org.apache.beam.sdk.coders.Coder<T>)
on the returned PCollection.protected String
Returns the name to use by default for thisPTransform
(not including the names of any enclosingPTransform
s).getName()
Returns the transform name.Returns resource hints set on the transform.void
populateDisplayData
(DisplayData.Builder builder) Register display data for the given transform or component.setDisplayData
(@NonNull List<DisplayData.ItemSpec<?>> displayData) Set display data for your PTransform.setResourceHints
(@NonNull ResourceHints resourceHints) Sets resource hints for the transform.toString()
void
validate
(@Nullable PipelineOptions options) Called before running the Pipeline to verify this transform is fully and correctly specified.void
validate
(@Nullable PipelineOptions options, Map<TupleTag<?>, PCollection<?>> inputs, Map<TupleTag<?>, PCollection<?>> outputs) Called before running the Pipeline to verify this transform, its inputs, and outputs are fully and correctly specified.
-
Field Details
-
name
The base name of thisPTransform
, e.g., from defaults, ornull
if not yet assigned. -
resourceHints
-
annotations
-
displayData
-
-
Constructor Details
-
PTransform
protected PTransform() -
PTransform
-
-
Method Details
-
expand
Override this method to specify how thisPTransform
should be expanded on the givenInputT
.NOTE: This method should not be called directly. Instead apply the
PTransform
should be applied to theInputT
using theapply
method.Composite transforms, which are defined in terms of other transforms, should return the output of one of the composed transforms. Non-composite transforms, which do not apply any transforms internally, should return a new unbound output and register evaluators (via backend-specific registration methods).
-
validate
Called before running the Pipeline to verify this transform is fully and correctly specified.By default, does nothing.
-
validate
public void validate(@Nullable PipelineOptions options, Map<TupleTag<?>, PCollection<?>> inputs, Map<TupleTag<?>, PCollection<?>> outputs) Called before running the Pipeline to verify this transform, its inputs, and outputs are fully and correctly specified.By default, delegates to
validate(PipelineOptions)
. -
getAdditionalInputs
Returns allPValues
that are consumed as inputs to thisPTransform
that are independent of the expansion of thePTransform
withinexpand(PInput)
.For example, this can contain any side input consumed by this
PTransform
. -
getName
Returns the transform name.This name is provided by the transform creator and is not required to be unique.
-
setResourceHints
Sets resource hints for the transform.- Parameters:
resourceHints
- aResourceHints
instance.- Returns:
- a reference to the same transfrom instance.
For example:
Pipeline p = ... ... p.apply(new SomeTransform().setResourceHints(ResourceHints.create().withMinRam("6 GiB"))) ...
-
getResourceHints
Returns resource hints set on the transform. -
setDisplayData
public PTransform<InputT,OutputT> setDisplayData(@NonNull List<DisplayData.ItemSpec<?>> displayData) Set display data for your PTransform.- Parameters:
displayData
- a list ofDisplayData.ItemSpec
instances.- Returns:
- a reference to the same transfrom instance.
For example:
Pipeline p = ... ... p.apply(new SomeTransform().setDisplayData(ImmutableList.of(DisplayData.item("userFn", userFn.getClass()))) ...
-
getAnnotations
Returns annotations map to provide additional hints to the runner. -
addAnnotation
-
toString
-
getKindString
Returns the name to use by default for thisPTransform
(not including the names of any enclosingPTransform
s).By default, returns the base name of this
PTransform
's class.The caller is responsible for ensuring that names of applied
PTransform
s are unique, e.g., by adding a uniquifying suffix when needed. -
getDefaultOutputCoder
Deprecated.Instead, the PTransform should explicitly callPCollection.setCoder(org.apache.beam.sdk.coders.Coder<T>)
on the returned PCollection.Returns the defaultCoder
to use for the output of this single-outputPTransform
.By default, always throws
- Throws:
CannotProvideCoderException
- if no coder can be inferred
-
getDefaultOutputCoder
@Deprecated protected Coder<?> getDefaultOutputCoder(InputT input) throws CannotProvideCoderException Deprecated.Instead, the PTransform should explicitly callPCollection.setCoder(org.apache.beam.sdk.coders.Coder<T>)
on the returned PCollection.Returns the defaultCoder
to use for the output of this single-outputPTransform
when applied to the given input.By default, always throws.
- Throws:
CannotProvideCoderException
- if none can be inferred.
-
getDefaultOutputCoder
@Deprecated public <T> Coder<T> getDefaultOutputCoder(InputT input, PCollection<T> output) throws CannotProvideCoderException Deprecated.Instead, the PTransform should explicitly callPCollection.setCoder(org.apache.beam.sdk.coders.Coder<T>)
on the returned PCollection.Returns the defaultCoder
to use for the given output of this single-outputPTransform
when applied to the given input.By default, always throws.
- Throws:
CannotProvideCoderException
- if none can be inferred.
-
populateDisplayData
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Parameters:
builder
- The builder to populate with display data.- See Also:
-
compose
public static <InputT extends PInput,OutputT extends POutput> PTransform<InputT,OutputT> compose(SerializableFunction<InputT, OutputT> fn) For aSerializableFunction<InputT, OutputT>
fn
, returns aPTransform
given by applyingfn.apply(v)
to the inputPCollection<InputT>
.Allows users to define a concise composite transform using a Java 8 lambda expression. For example:
PCollection<String> words = wordsAndErrors.apply( (PCollectionTuple input) -> { input.get(errorsTag).apply(new WriteErrorOutput()); return input.get(wordsTag); });
-
compose
public static <InputT extends PInput,OutputT extends POutput> PTransform<InputT,OutputT> compose(String name, SerializableFunction<InputT, OutputT> fn) Likecompose(SerializableFunction)
, but with a custom name.
-
PCollection.setCoder(org.apache.beam.sdk.coders.Coder<T>)
on the returned PCollection.