Class EvaluationContext
java.lang.Object
org.apache.beam.runners.spark.translation.EvaluationContext
The EvaluationContext allows us to define pipeline instructions and translate between
PObject<T>s or PCollection<T>s and Ts or DStreams/RDDs of Ts.-
Constructor Summary
ConstructorsConstructorDescriptionEvaluationContext(org.apache.spark.api.java.JavaSparkContext jsc, Pipeline pipeline, PipelineOptions options) EvaluationContext(org.apache.spark.api.java.JavaSparkContext jsc, Pipeline pipeline, PipelineOptions options, org.apache.spark.streaming.api.java.JavaStreamingContext jssc) -
Method Summary
Modifier and TypeMethodDescriptionborrowDataset(PTransform<? extends PValue, ?> transform) borrowDataset(PValue pvalue) voidComputes the outputs for all RDDs that are leaves in the DAG and do not have any actions (like saving to a file) registered on them (i.e.<T> TRetrieve an object of Type T associated with the PValue passed in.Get the map of cache candidates hold by the evaluation context.Map<GroupByKey<?, ?>, String> Get the map of GBK transforms to their full names, which are candidates for group by key and window translation which aims to reduce memory usage.org.apache.beam.sdk.runners.AppliedPTransform<?, ?, ?> <T extends PValue>
TgetInput(PTransform<T, ?> transform) <T> Map<TupleTag<?>, PCollection<?>> getInputs(PTransform<?, ?> transform) <T extends PValue>
TgetOutput(PTransform<?, T> transform) Map<TupleTag<?>, PCollection<?>> getOutputs(PTransform<?, ?> transform) Map<PCollection<?>, Integer> Get the map ofPCollectionto the number ofPTransformconsuming it.Return the current views creates in the pipeline.org.apache.beam.runners.core.construction.SerializablePipelineOptionsorg.apache.spark.api.java.JavaSparkContextorg.apache.spark.streaming.api.java.JavaStreamingContext<K,V> boolean isCandidateForGroupByKeyAndWindow(GroupByKey<K, V> transform) Returns if given GBK transform can be considered as candidate for group by key and window translation aiming to reduce memory usage.booleanisLeaf(PCollection<?> pCollection) Check if givenPCollectionis a leaf or not.booleanChecks if any of the side inputs in the pipeline are streaming side inputs.voidputDataset(PTransform<?, ? extends PValue> transform, Dataset dataset) Add single output of transform to context map and possibly cache if it conformsshouldCache(PTransform, PValue).voidputDataset(PValue pvalue, Dataset dataset) Add output of transform to context map and possibly cache if it conformsshouldCache(PTransform, PValue).voidputPView(PCollectionView<?> view, Iterable<WindowedValue<?>> value, Coder<Iterable<WindowedValue<?>>> coder) Adds/Replaces a view to the current views creates in the pipeline.voidreportPCollectionConsumed(PCollection<?> pCollection) Reports that givenPCollectionis consumed by aPTransformin the pipeline.voidreportPCollectionProduced(PCollection<?> pCollection) Reports that givenPCollectionis consumed by aPTransformin the pipeline.voidsetCurrentTransform(org.apache.beam.sdk.runners.AppliedPTransform<?, ?, ?> transform) booleanshouldCache(PTransform<?, ? extends PValue> transform, PValue pvalue) Cache PCollection if SparkPipelineOptions.isCacheDisabled is false or transform isn't GroupByKey transformation and PCollection is used more then once in Pipeline.voidMarks that the pipeline contains at least one streaming side input.
-
Constructor Details
-
EvaluationContext
public EvaluationContext(org.apache.spark.api.java.JavaSparkContext jsc, Pipeline pipeline, PipelineOptions options) -
EvaluationContext
public EvaluationContext(org.apache.spark.api.java.JavaSparkContext jsc, Pipeline pipeline, PipelineOptions options, org.apache.spark.streaming.api.java.JavaStreamingContext jssc)
-
-
Method Details
-
getSparkContext
public org.apache.spark.api.java.JavaSparkContext getSparkContext() -
getStreamingContext
public org.apache.spark.streaming.api.java.JavaStreamingContext getStreamingContext() -
getPipeline
-
getOptions
-
getSerializableOptions
public org.apache.beam.runners.core.construction.SerializablePipelineOptions getSerializableOptions() -
setCurrentTransform
public void setCurrentTransform(org.apache.beam.sdk.runners.AppliedPTransform<?, ?, ?> transform) -
getCurrentTransform
public org.apache.beam.sdk.runners.AppliedPTransform<?,?, getCurrentTransform()?> -
getInput
-
getInputs
-
getOutput
-
getOutputs
-
getOutputCoders
-
shouldCache
Cache PCollection if SparkPipelineOptions.isCacheDisabled is false or transform isn't GroupByKey transformation and PCollection is used more then once in Pipeline.PCollection is not cached in GroupByKey transformation, because Spark automatically persists some intermediate data in shuffle operations, even without users calling persist.
- Parameters:
transform- the transform to checkpvalue- output of transform- Returns:
- if PCollection will be cached
-
putDataset
Add single output of transform to context map and possibly cache if it conformsshouldCache(PTransform, PValue).- Parameters:
transform- from which Dataset was createddataset- created Dataset from transform
-
putDataset
Add output of transform to context map and possibly cache if it conformsshouldCache(PTransform, PValue). Used when PTransform has multiple outputs.- Parameters:
pvalue- one of multiple outputs of transformdataset- created Dataset from transform
-
borrowDataset
-
borrowDataset
-
computeOutputs
public void computeOutputs()Computes the outputs for all RDDs that are leaves in the DAG and do not have any actions (like saving to a file) registered on them (i.e. they are performed for side effects). -
get
Retrieve an object of Type T associated with the PValue passed in.- Type Parameters:
T- Type of object to return.- Parameters:
value- PValue to retrieve associated data for.- Returns:
- Native object.
-
getPViews
Return the current views creates in the pipeline.- Returns:
- SparkPCollectionView
-
putPView
public void putPView(PCollectionView<?> view, Iterable<WindowedValue<?>> value, Coder<Iterable<WindowedValue<?>>> coder) Adds/Replaces a view to the current views creates in the pipeline.- Parameters:
view- - Identifier of the viewvalue- - Actual value of the viewcoder- - Coder of the value
-
getCacheCandidates
Get the map of cache candidates hold by the evaluation context.- Returns:
- The current
Mapof cache candidates.
-
getCandidatesForGroupByKeyAndWindowTranslation
Get the map of GBK transforms to their full names, which are candidates for group by key and window translation which aims to reduce memory usage.- Returns:
- The current
Mapof candidates
-
isCandidateForGroupByKeyAndWindow
Returns if given GBK transform can be considered as candidate for group by key and window translation aiming to reduce memory usage.- Type Parameters:
K- type of GBK keyV- type of GBK value- Parameters:
transform- to evaluate- Returns:
- true if given transform is a candidate; false otherwise
-
reportPCollectionConsumed
Reports that givenPCollectionis consumed by aPTransformin the pipeline.- See Also:
-
reportPCollectionProduced
Reports that givenPCollectionis consumed by aPTransformin the pipeline.- See Also:
-
getPCollectionConsumptionMap
Get the map ofPCollectionto the number ofPTransformconsuming it.- Returns:
-
isLeaf
Check if givenPCollectionis a leaf or not.PCollectionis a leaf when there is no otherPTransformconsuming it / depending on it.- Parameters:
pCollection- to be checked if it is a leaf- Returns:
- true if pCollection is leaf; otherwise false
-
storageLevel
-
isStreamingSideInput
public boolean isStreamingSideInput()Checks if any of the side inputs in the pipeline are streaming side inputs.If at least one of the side inputs is a streaming side input, this method returns true. When streaming side inputs are present, the
CachedSideInputReaderwill not be used.- Returns:
- true if any of the side inputs in the pipeline are streaming side inputs, false otherwise
-
useStreamingSideInput
public void useStreamingSideInput()Marks that the pipeline contains at least one streaming side input.When this method is called, it sets the streamingSideInput flag to true, indicating that the
CachedSideInputReadershould not be used for processing side inputs.
-