Class VarianceFn<T extends Number>
- All Implemented Interfaces:
Serializable
,CombineFnBase.GlobalCombineFn<T,
,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator, T> HasDisplayData
Combine.CombineFn
for Variance on Number
types.
Calculates Population Variance and Sample Variance using incremental formulas described, for example, by Chan, Golub, and LeVeque in "Algorithms for computing the sample variance: analysis and recommendations", The American Statistician, 37 (1983) pp. 242--247.
If variance is defined like this:
- Input elements:
(x[1], ... , x[n])
- Sum of elements: {sum(x) = x[1] + ... + x[n]}
- Average of all elements in the input:
mean(x) = sum(x) / n
- Deviation of
i
th element from the current mean:deviation(x, i) = x[i] - mean(n)
- Variance:
variance(x) = deviation(x, 1)^2 + ... + deviation(x, n)^2
Then variance of combined input of 2 samples (x[1], ... , x[n])
and (y[1], ...
, y[m])
is calculated using this formula:
variance(concat(x,y)) = variance(x) + variance(y) + increment
, where:increment = m/(n(m+n)) * (n/m * sum(x) - sum(y))^2
This is also applicable for a single element increment, assuming that variance of a single element input is zero
To implement the above formula we keep track of the current variation, sum, and count of elements, and then use the formula whenever new element comes or we need to merge variances for 2 samples.
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionorg.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator
addInput
(org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator currentVariance, T rawInput) Adds the given input value to the given accumulator, returning the new accumulator value.org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator
Returns a new, mutable accumulator value, representing the accumulation of zero input values.extractOutput
(org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator accumulator) Returns the output value that is the result of combining all the input values represented by the given accumulator.TypeVariable
<?> Returns theTypeVariable
ofAccumT
.Coder
<org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator> getAccumulatorCoder
(CoderRegistry registry, Coder<T> inputCoder) Returns theCoder
to use for accumulatorAccumT
values, or null if it is not able to be inferred.getDefaultOutputCoder
(CoderRegistry registry, Coder<T> inputCoder) Returns theCoder
to use by default for outputOutputT
values, or null if it is not able to be inferred.Returns the error message for not supported default values in Combine.globally().TypeVariable
<?> Returns theTypeVariable
ofInputT
.TypeVariable
<?> Returns theTypeVariable
ofOutputT
.org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator
mergeAccumulators
(Iterable<org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator> variances) Returns an accumulator representing the accumulation of all the input values accumulated in the merging accumulators.static VarianceFn
newPopulation
(Schema.TypeName typeName) static <V extends Number>
VarianceFnnewPopulation
(SerializableFunction<BigDecimal, V> decimalConverter) static VarianceFn
newSample
(Schema.TypeName typeName) static <V extends Number>
VarianceFnnewSample
(SerializableFunction<BigDecimal, V> decimalConverter) void
populateDisplayData
(DisplayData.Builder builder) Register display data for the given transform or component.Methods inherited from class org.apache.beam.sdk.transforms.Combine.CombineFn
apply, compact, defaultValue, getInputType, getOutputType
-
Method Details
-
newPopulation
-
newPopulation
public static <V extends Number> VarianceFn newPopulation(SerializableFunction<BigDecimal, V> decimalConverter) -
newSample
-
newSample
public static <V extends Number> VarianceFn newSample(SerializableFunction<BigDecimal, V> decimalConverter) -
createAccumulator
public org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator createAccumulator()Description copied from class:Combine.CombineFn
Returns a new, mutable accumulator value, representing the accumulation of zero input values.- Specified by:
createAccumulator
in classCombine.CombineFn<T extends Number,
org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator, T extends Number>
-
addInput
public org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator addInput(org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator currentVariance, T rawInput) Description copied from class:Combine.CombineFn
Adds the given input value to the given accumulator, returning the new accumulator value. -
mergeAccumulators
public org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator mergeAccumulators(Iterable<org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator> variances) Description copied from class:Combine.CombineFn
Returns an accumulator representing the accumulation of all the input values accumulated in the merging accumulators.- Specified by:
mergeAccumulators
in classCombine.CombineFn<T extends Number,
org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator, T extends Number> - Parameters:
variances
- only the first accumulator may be modified and returned for efficiency; the other accumulators should not be mutated, because they may be shared with other code and mutating them could lead to incorrect results or data corruption.
-
getAccumulatorCoder
public Coder<org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator> getAccumulatorCoder(CoderRegistry registry, Coder<T> inputCoder) Description copied from interface:CombineFnBase.GlobalCombineFn
Returns theCoder
to use for accumulatorAccumT
values, or null if it is not able to be inferred.By default, uses the knowledge of the
Coder
being used forInputT
values and the enclosingPipeline
'sCoderRegistry
to try to infer the Coder forAccumT
values.This is the Coder used to send data through a communication-intensive shuffle step, so a compact and efficient representation may have significant performance benefits.
- Specified by:
getAccumulatorCoder
in interfaceCombineFnBase.GlobalCombineFn<T extends Number,
org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator, T extends Number>
-
extractOutput
public T extractOutput(org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator accumulator) Description copied from class:Combine.CombineFn
Returns the output value that is the result of combining all the input values represented by the given accumulator.- Specified by:
extractOutput
in classCombine.CombineFn<T extends Number,
org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator, T extends Number> - Parameters:
accumulator
- can be modified for efficiency
-
getDefaultOutputCoder
public Coder<T> getDefaultOutputCoder(CoderRegistry registry, Coder<T> inputCoder) throws CannotProvideCoderException Description copied from interface:CombineFnBase.GlobalCombineFn
Returns theCoder
to use by default for outputOutputT
values, or null if it is not able to be inferred.By default, uses the knowledge of the
Coder
being used for inputInputT
values and the enclosingPipeline
'sCoderRegistry
to try to infer the Coder forOutputT
values.- Specified by:
getDefaultOutputCoder
in interfaceCombineFnBase.GlobalCombineFn<InputT,
AccumT, OutputT> - Throws:
CannotProvideCoderException
-
getIncompatibleGlobalWindowErrorMessage
Description copied from interface:CombineFnBase.GlobalCombineFn
Returns the error message for not supported default values in Combine.globally().- Specified by:
getIncompatibleGlobalWindowErrorMessage
in interfaceCombineFnBase.GlobalCombineFn<InputT,
AccumT, OutputT>
-
getInputTVariable
Returns theTypeVariable
ofInputT
. -
getAccumTVariable
Returns theTypeVariable
ofAccumT
. -
getOutputTVariable
Returns theTypeVariable
ofOutputT
. -
populateDisplayData
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Parameters:
builder
- The builder to populate with display data.- See Also:
-