Class VarianceFn<T extends Number>
- All Implemented Interfaces:
Serializable,CombineFnBase.GlobalCombineFn<T,,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator, T> HasDisplayData
Combine.CombineFn for Variance on Number types.
Calculates Population Variance and Sample Variance using incremental formulas described, for example, by Chan, Golub, and LeVeque in "Algorithms for computing the sample variance: analysis and recommendations", The American Statistician, 37 (1983) pp. 242--247.
If variance is defined like this:
- Input elements:
(x[1], ... , x[n]) - Sum of elements: {sum(x) = x[1] + ... + x[n]}
- Average of all elements in the input:
mean(x) = sum(x) / n - Deviation of
ith element from the current mean:deviation(x, i) = x[i] - mean(n) - Variance:
variance(x) = deviation(x, 1)^2 + ... + deviation(x, n)^2
Then variance of combined input of 2 samples (x[1], ... , x[n]) and (y[1], ...
, y[m]) is calculated using this formula:
variance(concat(x,y)) = variance(x) + variance(y) + increment, where:increment = m/(n(m+n)) * (n/m * sum(x) - sum(y))^2
This is also applicable for a single element increment, assuming that variance of a single element input is zero
To implement the above formula we keep track of the current variation, sum, and count of elements, and then use the formula whenever new element comes or we need to merge variances for 2 samples.
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionorg.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulatoraddInput(org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator currentVariance, T rawInput) Adds the given input value to the given accumulator, returning the new accumulator value.org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulatorReturns a new, mutable accumulator value, representing the accumulation of zero input values.extractOutput(org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator accumulator) Returns the output value that is the result of combining all the input values represented by the given accumulator.TypeVariable<?> Returns theTypeVariableofAccumT.Coder<org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator> getAccumulatorCoder(CoderRegistry registry, Coder<T> inputCoder) Returns theCoderto use for accumulatorAccumTvalues, or null if it is not able to be inferred.getDefaultOutputCoder(CoderRegistry registry, Coder<T> inputCoder) Returns theCoderto use by default for outputOutputTvalues, or null if it is not able to be inferred.Returns the error message for not supported default values in Combine.globally().TypeVariable<?> Returns theTypeVariableofInputT.TypeVariable<?> Returns theTypeVariableofOutputT.org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulatormergeAccumulators(Iterable<org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator> variances) Returns an accumulator representing the accumulation of all the input values accumulated in the merging accumulators.static VarianceFnnewPopulation(Schema.TypeName typeName) static <V extends Number>
VarianceFnnewPopulation(SerializableFunction<BigDecimal, V> decimalConverter) static VarianceFnnewSample(Schema.TypeName typeName) static <V extends Number>
VarianceFnnewSample(SerializableFunction<BigDecimal, V> decimalConverter) voidpopulateDisplayData(DisplayData.Builder builder) Register display data for the given transform or component.Methods inherited from class org.apache.beam.sdk.transforms.Combine.CombineFn
apply, compact, defaultValue, getInputType, getOutputType
-
Method Details
-
newPopulation
-
newPopulation
public static <V extends Number> VarianceFn newPopulation(SerializableFunction<BigDecimal, V> decimalConverter) -
newSample
-
newSample
public static <V extends Number> VarianceFn newSample(SerializableFunction<BigDecimal, V> decimalConverter) -
createAccumulator
public org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator createAccumulator()Description copied from class:Combine.CombineFnReturns a new, mutable accumulator value, representing the accumulation of zero input values.- Specified by:
createAccumulatorin classCombine.CombineFn<T extends Number,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator, T extends Number>
-
addInput
public org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator addInput(org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator currentVariance, T rawInput) Description copied from class:Combine.CombineFnAdds the given input value to the given accumulator, returning the new accumulator value. -
mergeAccumulators
public org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator mergeAccumulators(Iterable<org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator> variances) Description copied from class:Combine.CombineFnReturns an accumulator representing the accumulation of all the input values accumulated in the merging accumulators.- Specified by:
mergeAccumulatorsin classCombine.CombineFn<T extends Number,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator, T extends Number> - Parameters:
variances- only the first accumulator may be modified and returned for efficiency; the other accumulators should not be mutated, because they may be shared with other code and mutating them could lead to incorrect results or data corruption.
-
getAccumulatorCoder
public Coder<org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator> getAccumulatorCoder(CoderRegistry registry, Coder<T> inputCoder) Description copied from interface:CombineFnBase.GlobalCombineFnReturns theCoderto use for accumulatorAccumTvalues, or null if it is not able to be inferred.By default, uses the knowledge of the
Coderbeing used forInputTvalues and the enclosingPipeline'sCoderRegistryto try to infer the Coder forAccumTvalues.This is the Coder used to send data through a communication-intensive shuffle step, so a compact and efficient representation may have significant performance benefits.
- Specified by:
getAccumulatorCoderin interfaceCombineFnBase.GlobalCombineFn<T extends Number,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator, T extends Number>
-
extractOutput
public T extractOutput(org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator accumulator) Description copied from class:Combine.CombineFnReturns the output value that is the result of combining all the input values represented by the given accumulator.- Specified by:
extractOutputin classCombine.CombineFn<T extends Number,org.apache.beam.sdk.extensions.sql.impl.transform.agg.VarianceAccumulator, T extends Number> - Parameters:
accumulator- can be modified for efficiency
-
getDefaultOutputCoder
public Coder<T> getDefaultOutputCoder(CoderRegistry registry, Coder<T> inputCoder) throws CannotProvideCoderException Description copied from interface:CombineFnBase.GlobalCombineFnReturns theCoderto use by default for outputOutputTvalues, or null if it is not able to be inferred.By default, uses the knowledge of the
Coderbeing used for inputInputTvalues and the enclosingPipeline'sCoderRegistryto try to infer the Coder forOutputTvalues.- Specified by:
getDefaultOutputCoderin interfaceCombineFnBase.GlobalCombineFn<InputT,AccumT, OutputT> - Throws:
CannotProvideCoderException
-
getIncompatibleGlobalWindowErrorMessage
Description copied from interface:CombineFnBase.GlobalCombineFnReturns the error message for not supported default values in Combine.globally().- Specified by:
getIncompatibleGlobalWindowErrorMessagein interfaceCombineFnBase.GlobalCombineFn<InputT,AccumT, OutputT>
-
getInputTVariable
Returns theTypeVariableofInputT. -
getAccumTVariable
Returns theTypeVariableofAccumT. -
getOutputTVariable
Returns theTypeVariableofOutputT. -
populateDisplayData
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData). Implementations may callsuper.populateDisplayData(builder)in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayDatain interfaceHasDisplayData- Parameters:
builder- The builder to populate with display data.- See Also:
-