Package org.apache.beam.sdk.transforms
@DefaultAnnotation(org.checkerframework.checker.nullness.qual.NonNull.class)
package org.apache.beam.sdk.transforms
Defines
PTransform
s for transforming data in a pipeline.
A PTransform
is an operation that takes an
InputT
(some subtype of PInput
) and produces an
OutputT
(some subtype of POutput
).
Common PTransforms include root PTransforms like TextIO.Read
and Create
, processing and conversion operations like
ParDo
, GroupByKey
,
CoGroupByKey
, Combine
, and Count
, and
outputting PTransforms like TextIO.Write
.
New PTransforms can be created by composing existing PTransforms. Most PTransforms in this package are composites, and users can also create composite PTransforms for their own application-specific logic.
-
ClassDescription
PTransform
s for getting an idea of aPCollection
's data distribution using approximateN
-tiles (e.g.ApproximateQuantiles.ApproximateQuantilesCombineFn<T,ComparatorT extends Comparator<T> & Serializable> TheApproximateQuantilesCombineFn
combiner gives an idea of the distribution of a collection of values using approximateN
-tiles.Deprecated.CombineFn
that computes an estimate of the number of distinct values that were combined.A heap utility class to efficiently track the largest added elements.PTransform
for estimating the number of distinct elements in aPCollection
.PTransform
for estimating the number of distinct values associated with each key in aPCollection
ofKV
s.PTransform
s for combiningPCollection
elements globally and per-key.Combine.AccumulatingCombineFn<InputT,AccumT extends Combine.AccumulatingCombineFn.Accumulator<InputT, AccumT, OutputT>, OutputT> ACombineFn
that uses a subclass ofCombine.AccumulatingCombineFn.Accumulator
as its accumulator type.Combine.AccumulatingCombineFn.Accumulator<InputT,AccumT, OutputT> The type of mutable accumulator values used by thisAccumulatingCombineFn
.An abstract subclass ofCombine.CombineFn
for implementing combiners that are more easily and efficiently expressed as binary operations ondouble
s.An abstract subclass ofCombine.CombineFn
for implementing combiners that are more easily expressed as binary operations.An abstract subclass ofCombine.CombineFn
for implementing combiners that are more easily and efficiently expressed as binary operations onint
sAn abstract subclass ofCombine.CombineFn
for implementing combiners that are more easily and efficiently expressed as binary operations onlong
s.Combine.CombineFn<InputT extends @Nullable Object,AccumT extends @Nullable Object, OutputT extends @Nullable Object> ACombineFn<InputT, AccumT, OutputT>
specifies how to combine a collection of input values of typeInputT
into a single output value of typeOutputT
.Combine.Globally<InputT,OutputT> Combine.Globally<InputT, OutputT>
takes aPCollection<InputT>
and returns aPCollection<OutputT>
whose elements are the result of combining all the elements in each window of the inputPCollection
, using a specifiedCombineFn<InputT, AccumT, OutputT>
.Combine.GloballyAsSingletonView<InputT,OutputT> Combine.GloballyAsSingletonView<InputT, OutputT>
takes aPCollection<InputT>
and returns aPCollectionView<OutputT>
whose elements are the result of combining all the elements in each window of the inputPCollection
, using a specifiedCombineFn<InputT, AccumT, OutputT>
.Combine.GroupedValues<K,InputT, OutputT> GroupedValues<K, InputT, OutputT>
takes aPCollection<KV<K, Iterable<InputT>>>
, such as the result ofGroupByKey
, applies a specifiedCombineFn<InputT, AccumT, OutputT>
to each of the inputKV<K, Iterable<InputT>>
elements to produce a combined outputKV<K, OutputT>
element, and returns aPCollection<KV<K, OutputT>>
containing all the combined output elements.Holds a single value value of typeV
which may or may not be present.Combine.PerKey<K,InputT, OutputT> PerKey<K, InputT, OutputT>
takes aPCollection<KV<K, InputT>>
, groups it by key, applies a combining function to theInputT
values associated with each key to produce a combinedOutputT
value, and returns aPCollection<KV<K, OutputT>>
representing a map from each distinct key of the inputPCollection
to the corresponding combined value.Combine.PerKeyWithHotKeyFanout<K,InputT, OutputT> LikeCombine.PerKey
, but sharding the combining of hot keys.Deprecated.For internal use only; no backwards-compatibility guarantees.CombineFnBase.GlobalCombineFn<InputT,AccumT, OutputT> For internal use only; no backwards-compatibility guarantees.Static utility methods that create combine function instances.A tuple of outputs produced by a composed combine functions.A builder class to construct a composedCombineFnBase.GlobalCombineFn
.CombineFns.ComposedCombineFn<DataT>A composedCombine.CombineFn
that applies multipleCombineFns
.A composedCombineWithContext.CombineFnWithContext
that applies multipleCombineFnWithContexts
.This class contains combine functions that have access toPipelineOptions
and side inputs throughCombineWithContext.Context
.CombineWithContext.CombineFnWithContext<InputT,AccumT, OutputT> A combine function that has access toPipelineOptions
and side inputs throughCombineWithContext.Context
.Information accessible to all methods inCombineFnWithContext
andKeyedCombineFnWithContext
.An internal interface for signaling that aGloballyCombineFn
or aPerKeyCombineFn
needs to accessCombineWithContext.Context
.Contextful<ClosureT>Pair of a bit of user code (a "closure") and theRequirements
needed to run it.Contextful.Fn<InputT,OutputT> A function from an input to an output that may additionally accessContextful.Fn.Context
when computing the result.An accessor for additional capabilities available inContextful.Fn.apply(InputT, org.apache.beam.sdk.transforms.Contextful.Fn.Context)
.PTransforms
to count the elements in aPCollection
.Create<T>Create<T>
takes a collection of elements of typeT
known when the pipeline is constructed and returns aPCollection<T>
containing the elements.APTransform
that creates aPCollection
whose elements have associated timestamps.APTransform
that creates aPCollection
from a set of in-memory objects.APTransform
that creates aPCollection
whose elements have associated windowing metadata.A set ofPTransform
s which deduplicate input records over a time domain and threshold.Deduplicates keyed values using the key over a specified time domain and threshold.Deduplicates values over a specified time domain and threshold.APTransform
that uses aSerializableFunction
to obtain a representative value for each input element used for deduplication.Distinct<T>Distinct<T>
takes aPCollection<T>
and returns aPCollection<T>
that has all distinct elements of the input.ADistinct
PTransform
that uses aSerializableFunction
to obtain a representative value for each input element.The argument toParDo
providing the code to use to process elements of the inputPCollection
.Annotation for declaring that a state parameter is always fetched.Annotation on a splittableDoFn
specifying that theDoFn
performs a bounded amount of work per input element, so applying it to a boundedPCollection
will produce also a boundedPCollection
.A parameter that is accessible during@StartBundle
,@ProcessElement
and@FinishBundle
that allows the caller to register a callback that will be invoked after the bundle has been successfully completed and the runner has commit the output.An instance of a function that will be invoked after bundle finalization.Parameter annotation for the input element forDoFn.ProcessElement
,DoFn.GetInitialRestriction
,DoFn.GetSize
,DoFn.SplitRestriction
,DoFn.GetInitialWatermarkEstimatorState
,DoFn.NewWatermarkEstimator
, andDoFn.NewTracker
methods.Annotation for specifying specific fields that are accessed in a Schema PCollection.Annotation for the method to use to finish processing a batch of elements.Annotation for the method that maps an element to an initial restriction for a splittableDoFn
.Annotation for the method that maps an element and restriction to initial watermark estimator state for a splittableDoFn
.Annotation for the method that returns the coder to use for the restriction of a splittableDoFn
.Annotation for the method that returns the corresponding size for an element and restriction pair.Annotation for the method that returns the coder to use for the watermark estimator state of a splittableDoFn
.Parameter annotation for dereferencing input element key inKV
pair.Receives tagged output for a multi-output function.Annotation for the method that creates a newRestrictionTracker
for the restriction of a splittableDoFn
.Annotation for the method that creates a newWatermarkEstimator
for the watermark state of a splittableDoFn
.Annotation for registering a callback for a timer.Annotation for registering a callback for a timerFamily.Annotation for the method to use for performing actions on window expiration.Receives values of the given type.When used as a return value ofDoFn.ProcessElement
, indicates whether there is more work to be done for the current element.Annotation for the method to use for processing elements.Annotation that may be added to aDoFn.ProcessElement
,DoFn.OnTimer
, orDoFn.OnWindowExpiration
method to indicate that the runner must ensure that the observable contents of the inputPCollection
or mutable state must be stable upon retries.Annotation that may be added to aDoFn.ProcessElement
method to indicate that the runner must ensure that the observable contents of the inputPCollection
is sorted by time, in ascending order.Parameter annotation for the restriction forDoFn.GetSize
,DoFn.SplitRestriction
,DoFn.GetInitialWatermarkEstimatorState
,DoFn.NewWatermarkEstimator
, andDoFn.NewTracker
methods.Annotation for the method to use to prepare an instance for processing bundles of elements.Parameter annotation for the SideInput for aDoFn.ProcessElement
method.Annotation for the method that splits restriction of a splittableDoFn
into multiple parts to be processed in parallel.Annotation for the method to use to prepare an instance for processing a batch of elements.Annotation for declaring and dereferencing state cells.Annotation for the method to use to clean up this instance before it is discarded.Parameter annotation for the TimerMap for aDoFn.ProcessElement
method.Annotation for declaring and dereferencing timers.Parameter annotation for the input element timestamp forDoFn.ProcessElement
,DoFn.GetInitialRestriction
,DoFn.GetSize
,DoFn.SplitRestriction
,DoFn.GetInitialWatermarkEstimatorState
,DoFn.NewWatermarkEstimator
, andDoFn.NewTracker
methods.Annotation for the method that truncates the restriction of a splittableDoFn
into a bounded one.Annotation on a splittableDoFn
specifying that theDoFn
performs an unbounded amount of work per input element, so applying it to a boundedPCollection
will produce an unboundedPCollection
.Parameter annotation for the watermark estimator state for theDoFn.NewWatermarkEstimator
method.CommonDoFn.OutputReceiver
andDoFn.MultiOutputReceiver
classes.Represents information about how a DoFn extracts schemas.The builder object.DoFnTester<InputT,OutputT> Deprecated.UseTestPipeline
with theDirectRunner
.Deprecated.UseTestPipeline
with theDirectRunner
.ExternalTransformBuilder<ConfigT,InputT extends PInput, OutputT extends POutput> An interface for building a transform from an externally provided configuration.Filter<T>PTransform
s for filtering from aPCollection
the elements satisfying a predicate, or satisfying an inequality with a given value based on the elements' natural ordering.FlatMapElements<InputT,OutputT> PTransform
s for mapping a simple function that returns iterables over the elements of aPCollection
and merging the results.FlatMapElements.FlatMapWithFailures<InputT,OutputT, FailureT> APTransform
that adds exception handling toFlatMapElements
.Flatten<T>
takes multiplePCollection<T>
s bundled into aPCollectionList<T>
and returns a singlePCollection<T>
containing all the elements in all the inputPCollection
s.FlattenIterables<T>
takes aPCollection<Iterable<T>>
and returns aPCollection<T>
that contains all the elements from each iterable.APTransform
that flattens aPCollectionList
into aPCollection
containing all the elements of all thePCollection
s in its input.GroupByKey<K,V> GroupByKey<K, V>
takes aPCollection<KV<K, V>>
, groups the values by key and windows, and returns aPCollection<KV<K, Iterable<V>>>
representing a map from each distinct key and window of the inputPCollection
to anIterable
over all the values associated with that key in the input per window.GroupIntoBatches<K,InputT> APTransform
that batches inputs to a desired batch size.GroupIntoBatches.BatchingParams<InputT>Wrapper class for batching parameters supplied by users.For internal use only; no backwards-compatibility guarantees.InferableFunction<InputT,OutputT> AProcessFunction
which is not a functional interface.The result of aJsonToRow.withExceptionReporting(Schema)
transform.Keys<K>Keys<K>
takes aPCollection
ofKV<K, V>
s and returns aPCollection<K>
of the keys.KvSwap<K,V> KvSwap<K, V>
takes aPCollection<KV<K, V>>
and returns aPCollection<KV<V, K>>
, where all the keys and values have been swapped.MapElements<InputT,OutputT> PTransform
s for mapping a simple function over the elements of aPCollection
.MapElements.MapWithFailures<InputT,OutputT, FailureT> APTransform
that adds exception handling toMapElements
.MapKeys<K1,K2, V> MapKeys
maps aSerializableFunction<K1,K2>
over keys of aPCollection<KV<K1,V>>
and returns aPCollection<KV<K2, V>>
.MapValues<K,V1, V2> MapValues
maps aSerializableFunction<V1,V2>
over values of aPCollection<KV<K,V1>>
and returns aPCollection<KV<K, V2>>
.For internal use only; no backwards-compatibility guarantees.For internal use only; no backwards-compatibility guarantees.Represents thePrimitiveViewT
supplied to theViewFn
when it declares to use theiterable materialization
.Represents thePrimitiveViewT
supplied to theViewFn
when it declares to use themultimap materialization
.PTransform
s for computing the maximum of the elements in aPCollection
, or the maximum of the values associated with each key in aPCollection
ofKV
s.PTransform
s for computing the arithmetic mean (a.k.a.PTransform
s for computing the minimum of the elements in aPCollection
, or the minimum of the values associated with each key in aPCollection
ofKV
s.ParDo
is the core element-wise transform in Apache Beam, invoking a user-specified function on each of the elements of the inputPCollection
to produce zero or more output elements, all of which are collected into the outputPCollection
.ParDo.MultiOutput<InputT,OutputT> APTransform
that, when applied to aPCollection<InputT>
, invokes a user-specifiedDoFn<InputT, OutputT>
on all its elements, which can emit elements to any of thePTransform
's outputPCollection
s, which are bundled into a resultPCollectionTuple
.ParDo.SingleOutput<InputT,OutputT> APTransform
that, when applied to aPCollection<InputT>
, invokes a user-specifiedDoFn<InputT, OutputT>
on all its elements, with all its outputs collected into an outputPCollection<OutputT>
.Partition<T>Partition
takes aPCollection<T>
and aPartitionFn
, uses thePartitionFn
to split the elements of the inputPCollection
intoN
partitions, and returns aPCollectionList<T>
that bundlesN
PCollection<T>
s containing the split elements.A function object that chooses an output partition for an element.A function object that chooses an output partition for an element.APTransform
which produces a sequence of elements at fixed runtime intervals.APTransform
which generates a sequence of timestamped elements at given runtime intervals.ProcessFunction<InputT,OutputT> A function that computes an output value of typeOutputT
from an input value of typeInputT
and isSerializable
.A family ofPTransforms
that returns aPCollection
equivalent to its input but functions as an operational hint to a runner that redistributing the data in some way is likely useful.Noop transform that hints to the runner to try to redistribute the work evenly, or via whatever clever strategy the runner comes up with.Registers translators for the Redistribute family of transforms.PTransform
s to use Regular Expressions to process elements in aPCollection
.Regex.MatchesName<String>
takes aPCollection<String>
and returns aPCollection<List<String>>
representing the value extracted from all the Regex groups of the inputPCollection
to the number of times that element occurs in the input.Regex.Find<String>
takes aPCollection<String>
and returns aPCollection<String>
representing the value extracted from the Regex groups of the inputPCollection
to the number of times that element occurs in the input.Regex.Find<String>
takes aPCollection<String>
and returns aPCollection<List<String>>
representing the value extracted from the Regex groups of the inputPCollection
to the number of times that element occurs in the input.Regex.MatchesKV<KV<String, String>>
takes aPCollection<String>
and returns aPCollection<KV<String, String>>
representing the key and value extracted from the Regex groups of the inputPCollection
to the number of times that element occurs in the input.Regex.Find<String>
takes aPCollection<String>
and returns aPCollection<String>
representing the value extracted from the Regex groups of the inputPCollection
to the number of times that element occurs in the input.Regex.MatchesKV<KV<String, String>>
takes aPCollection<String>
and returns aPCollection<KV<String, String>>
representing the key and value extracted from the Regex groups of the inputPCollection
to the number of times that element occurs in the input.Regex.Matches<String>
takes aPCollection<String>
and returns aPCollection<String>
representing the value extracted from the Regex groups of the inputPCollection
to the number of times that element occurs in the input.Regex.MatchesKV<KV<String, String>>
takes aPCollection<String>
and returns aPCollection<KV<String, String>>
representing the key and value extracted from the Regex groups of the inputPCollection
to the number of times that element occurs in the input.Regex.MatchesName<String>
takes aPCollection<String>
and returns aPCollection<String>
representing the value extracted from the Regex groups of the inputPCollection
to the number of times that element occurs in the input.Regex.MatchesNameKV<KV<String, String>>
takes aPCollection<String>
and returns aPCollection<KV<String, String>>
representing the key and value extracted from the Regex groups of the inputPCollection
to the number of times that element occurs in the input.Regex.ReplaceAll<String>
takes aPCollection<String>
and returns aPCollection<String>
with all Strings that matched the Regex being replaced with the replacement string.Regex.ReplaceFirst<String>
takes aPCollection<String>
and returns aPCollection<String>
with the first Strings that matched the Regex being replaced with the replacement string.Regex.Split<String>
takes aPCollection<String>
and returns aPCollection<String>
with the input string split into individual items in a list.PTransforms
for converting between explicit and implicit form of various Beam values.Describes the run-time requirements of aContextful
, such as access to side inputs.Reshuffle<K,V> For internal use only; no backwards compatibility guarantees.Implementation ofReshuffle.viaRandomKey()
.PTransform
s for taking samples of the elements in aPCollection
, or samples of the values associated with each key in aPCollection
ofKV
s.CombineFn
that computes a fixed-size sample of a collection of values.SerializableBiConsumer<FirstInputT,SecondInputT> A union of theBiConsumer
andSerializable
interfaces.SerializableBiFunction<FirstInputT,SecondInputT, OutputT> A union of theBiFunction
andSerializable
interfaces.AComparator
that is alsoSerializable
.SerializableFunction<InputT,OutputT> A function that computes an output value of typeOutputT
from an input value of typeInputT
, isSerializable
, and does not allow checked exceptions to be declared.UsefulSerializableFunction
overrides.ThePTransform
s that allow to compute different set functions acrossPCollection
s.SimpleFunction<InputT,OutputT> ASerializableFunction
which is not a functional interface.PTransform
s for computing the sum of the elements in aPCollection
, or the sum of the values associated with each key in aPCollection
ofKV
s.Tee<T>A PTransform that returns its input, but also applies its input to an auxiliary PTransform, akin to the shelltee
command, which is named after the T-splitter used in plumbing.ToJson<T>Creates aPTransform
that serializes UTF-8 JSON objects from aSchema
-aware PCollection (i.e.PTransform
s for finding the largest (or smallest) set of elements in aPCollection
, or the largest (or smallest) set of values associated with each key in aPCollection
ofKV
s.Top.Largest<T extends Comparable<? super T>>Deprecated.useTop.Natural
insteadTop.Natural<T extends Comparable<? super T>>ASerializable
Comparator
that that uses the compared elements' natural ordering.Top.Reversed<T extends Comparable<? super T>>Serializable
Comparator
that that uses the reverse of the compared elements' natural ordering.Top.Smallest<T extends Comparable<? super T>>Deprecated.useTop.Reversed
insteadTop.TopCombineFn<T,ComparatorT extends Comparator<T> & Serializable> CombineFn
forTop
transforms that combines a bunch ofT
s into a singlecount
-longList<T>
, usingcompareFn
to choose the largestT
s.PTransforms
for converting aPCollection<?>
,PCollection<KV<?,?>>
, orPCollection<Iterable<?>>
to aPCollection<String>
.Values<V>Values<V>
takes aPCollection
ofKV<K, V>
s and returns aPCollection<V>
of the values.Transforms for creatingPCollectionViews
fromPCollections
(to read them as side inputs).For internal use only; no backwards-compatibility guarantees.View.AsList<T>For internal use only; no backwards-compatibility guarantees.View.AsMap<K,V> For internal use only; no backwards-compatibility guarantees.View.AsMultimap<K,V> For internal use only; no backwards-compatibility guarantees.For internal use only; no backwards-compatibility guarantees.View.CreatePCollectionView<ElemT,ViewT> For internal use only; no backwards-compatibility guarantees.Provides an index to value mapping using a random starting index and also provides an offset range for each window seen.ViewFn<PrimitiveViewT,ViewT> For internal use only; no backwards-compatibility guarantees.Delays processing of each window in aPCollection
until signaled.Implementation ofWait.on(org.apache.beam.sdk.values.PCollection<?>...)
.Given a "poll function" that produces a potentially growing set of outputs for an input, this transform simultaneously continuously watches the growth of output sets of all inputs, until a per-input termination condition is reached.Watch.Growth<InputT,OutputT, KeyT> Watch.Growth.PollFn<InputT,OutputT> A function that computes the current set of outputs for the given input, in the form of aWatch.Growth.PollResult
.Watch.Growth.PollResult<OutputT>The result of a single invocation of aWatch.Growth.PollFn
.Watch.Growth.TerminationCondition<InputT,StateT> A strategy for determining whether it is time to stop polling the current input regardless of whether its output is complete or not.Watch.WatchGrowthFn<InputT,OutputT, KeyT, TerminationStateT> A collection of utilities for writing transforms that can handle exceptions raised during processing of elements.A simple handler that extracts information from an exception to aMap<String, String>
and returns aKV
where the key is the input element that failed processing, and the value is the map of exception attributes.The value type passed as input to exception handlers.An intermediate output type for PTransforms that allows an output collection to live alongside a collection of elements that failed the transform.WithKeys<K,V> WithKeys<K, V>
takes aPCollection<V>
, and either a constant key of typeK
or a function fromV
toK
, and returns aPCollection<KV<K, V>>
, where each of the values in the inputPCollection
has been paired with either the constant key or a key computed from the value.APTransform
for assigning timestamps to all the elements of aPCollection
.
ApproximateCountDistinct
in thezetasketch
extension module, which makes use of theHllCount
implementation.