org.apache.beam.sdk.extensions.sql.impl.transform.agg

## Class VarianceFn<T extends java.lang.Number>

• All Implemented Interfaces:
java.io.Serializable, CombineFnBase.GlobalCombineFn<T,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator,T>, HasDisplayData

@Internal
public class VarianceFn<T extends java.lang.Number>
extends Combine.CombineFn<T,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator,T>
Combine.CombineFn for Variance on Number types.

Calculates Population Variance and Sample Variance using incremental formulas described, for example, by Chan, Golub, and LeVeque in "Algorithms for computing the sample variance: analysis and recommendations", The American Statistician, 37 (1983) pp. 242--247.

If variance is defined like this:

• Input elements: (x[1], ... , x[n])
• Sum of elements: {sum(x) = x[1] + ... + x[n]}
• Average of all elements in the input: mean(x) = sum(x) / n
• Deviation of ith element from the current mean: deviation(x, i) = x[i] - mean(n)
• Variance: variance(x) = deviation(x, 1)^2 + ... + deviation(x, n)^2

Then variance of combined input of 2 samples (x[1], ... , x[n]) and (y[1], ... , y[m]) is calculated using this formula:

• variance(concat(x,y)) = variance(x) + variance(y) + increment, where:
• increment = m/(n(m+n)) * (n/m * sum(x) - sum(y))^2

This is also applicable for a single element increment, assuming that variance of a single element input is zero

To implement the above formula we keep track of the current variation, sum, and count of elements, and then use the formula whenever new element comes or we need to merge variances for 2 samples.

Serialized Form
• ### Method Summary

All Methods
Modifier and Type Method and Description
Adds the given input value to the given accumulator, returning the new accumulator value.
org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator createAccumulator()
Returns a new, mutable accumulator value, representing the accumulation of zero input values.
T extractOutput(org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator accumulator)
Returns the output value that is the result of combining all the input values represented by the given accumulator.
java.lang.reflect.TypeVariable<?> getAccumTVariable()
Returns the TypeVariable of AccumT.
Coder<org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator> getAccumulatorCoder(CoderRegistry registry, Coder<T> inputCoder)
Returns the Coder to use for accumulator AccumT values, or null if it is not able to be inferred.
Coder<OutputT> getDefaultOutputCoder(CoderRegistry registry, Coder<InputT> inputCoder)
Returns the Coder to use by default for output OutputT values, or null if it is not able to be inferred.
java.lang.String getIncompatibleGlobalWindowErrorMessage()
Returns the error message for not supported default values in Combine.globally().
java.lang.reflect.TypeVariable<?> getInputTVariable()
Returns the TypeVariable of InputT.
java.lang.reflect.TypeVariable<?> getOutputTVariable()
Returns the TypeVariable of OutputT.
org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator mergeAccumulators(java.lang.Iterable<org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator> variances)
Returns an accumulator representing the accumulation of all the input values accumulated in the merging accumulators.
static <V extends java.lang.Number>
VarianceFn
newPopulation(Schema.TypeName typeName)
static <V extends java.lang.Number>
VarianceFn
newPopulation(SerializableFunction<java.math.BigDecimal,V> decimalConverter)
static <V extends java.lang.Number>
VarianceFn
newSample(Schema.TypeName typeName)
static <V extends java.lang.Number>
VarianceFn
newSample(SerializableFunction<java.math.BigDecimal,V> decimalConverter)
void populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
• ### Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
• ### Method Detail

• #### newPopulation

public static <V extends java.lang.Number> VarianceFn newPopulation(Schema.TypeName typeName)
• #### newPopulation

public static <V extends java.lang.Number> VarianceFn newPopulation(SerializableFunction<java.math.BigDecimal,V> decimalConverter)
• #### newSample

public static <V extends java.lang.Number> VarianceFn newSample(Schema.TypeName typeName)
• #### newSample

public static <V extends java.lang.Number> VarianceFn newSample(SerializableFunction<java.math.BigDecimal,V> decimalConverter)
• #### createAccumulator

public org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator createAccumulator()
Description copied from class: Combine.CombineFn
Returns a new, mutable accumulator value, representing the accumulation of zero input values.
Specified by:
createAccumulator in class Combine.CombineFn<T extends java.lang.Number,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator,T extends java.lang.Number>

T rawInput)
Description copied from class: Combine.CombineFn
Adds the given input value to the given accumulator, returning the new accumulator value.
Specified by:
addInput in class Combine.CombineFn<T extends java.lang.Number,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator,T extends java.lang.Number>
Parameters:
currentVariance - may be modified and returned for efficiency
rawInput - should not be mutated
• #### mergeAccumulators

public org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator mergeAccumulators(java.lang.Iterable<org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator> variances)
Description copied from class: Combine.CombineFn
Returns an accumulator representing the accumulation of all the input values accumulated in the merging accumulators.
Specified by:
mergeAccumulators in class Combine.CombineFn<T extends java.lang.Number,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator,T extends java.lang.Number>
Parameters:
variances - only the first accumulator may be modified and returned for efficiency; the other accumulators should not be mutated, because they may be shared with other code and mutating them could lead to incorrect results or data corruption.
• #### getAccumulatorCoder

public Coder<org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator> getAccumulatorCoder(CoderRegistry registry,
Coder<T> inputCoder)
Description copied from interface: CombineFnBase.GlobalCombineFn
Returns the Coder to use for accumulator AccumT values, or null if it is not able to be inferred.

By default, uses the knowledge of the Coder being used for InputT values and the enclosing Pipeline's CoderRegistry to try to infer the Coder for AccumT values.

This is the Coder used to send data through a communication-intensive shuffle step, so a compact and efficient representation may have significant performance benefits.

Specified by:
getAccumulatorCoder in interface CombineFnBase.GlobalCombineFn<T extends java.lang.Number,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator,T extends java.lang.Number>
• #### extractOutput

public T extractOutput(org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator accumulator)
Description copied from class: Combine.CombineFn
Returns the output value that is the result of combining all the input values represented by the given accumulator.
Specified by:
extractOutput in class Combine.CombineFn<T extends java.lang.Number,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator,T extends java.lang.Number>
Parameters:
accumulator - can be modified for efficiency
• #### getInputTVariable

public java.lang.reflect.TypeVariable<?> getInputTVariable()
Returns the TypeVariable of InputT.
• #### getAccumTVariable

public java.lang.reflect.TypeVariable<?> getAccumTVariable()
Returns the TypeVariable of AccumT.
• #### getOutputTVariable

public java.lang.reflect.TypeVariable<?> getOutputTVariable()
Returns the TypeVariable of OutputT.
• #### populateDisplayData

public void populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.

populateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect display data via DisplayData.from(HasDisplayData). Implementations may call super.populateDisplayData(builder) in order to register display data in the current namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use the namespace of the subcomponent.

By default, does not register any display data. Implementors may override this method to provide their own display data.

Specified by:
populateDisplayData in interface HasDisplayData
Parameters:
builder - The builder to populate with display data.