Class EvaluationContext

java.lang.Object
org.apache.beam.runners.spark.translation.EvaluationContext

public class EvaluationContext extends Object
The EvaluationContext allows us to define pipeline instructions and translate between PObject<T>s or PCollection<T>s and Ts or DStreams/RDDs of Ts.
  • Constructor Details

    • EvaluationContext

      public EvaluationContext(org.apache.spark.api.java.JavaSparkContext jsc, Pipeline pipeline, PipelineOptions options)
    • EvaluationContext

      public EvaluationContext(org.apache.spark.api.java.JavaSparkContext jsc, Pipeline pipeline, PipelineOptions options, org.apache.spark.streaming.api.java.JavaStreamingContext jssc)
  • Method Details

    • getSparkContext

      public org.apache.spark.api.java.JavaSparkContext getSparkContext()
    • getStreamingContext

      public org.apache.spark.streaming.api.java.JavaStreamingContext getStreamingContext()
    • getPipeline

      public Pipeline getPipeline()
    • getOptions

      public PipelineOptions getOptions()
    • getSerializableOptions

      public org.apache.beam.runners.core.construction.SerializablePipelineOptions getSerializableOptions()
    • setCurrentTransform

      public void setCurrentTransform(org.apache.beam.sdk.runners.AppliedPTransform<?,?,?> transform)
    • getCurrentTransform

      public org.apache.beam.sdk.runners.AppliedPTransform<?,?,?> getCurrentTransform()
    • getInput

      public <T extends PValue> T getInput(PTransform<T,?> transform)
    • getInputs

      public <T> Map<TupleTag<?>,PCollection<?>> getInputs(PTransform<?,?> transform)
    • getOutput

      public <T extends PValue> T getOutput(PTransform<?,T> transform)
    • getOutputs

      public Map<TupleTag<?>,PCollection<?>> getOutputs(PTransform<?,?> transform)
    • getOutputCoders

      public Map<TupleTag<?>,Coder<?>> getOutputCoders()
    • shouldCache

      public boolean shouldCache(PTransform<?,? extends PValue> transform, PValue pvalue)
      Cache PCollection if SparkPipelineOptions.isCacheDisabled is false or transform isn't GroupByKey transformation and PCollection is used more then once in Pipeline.

      PCollection is not cached in GroupByKey transformation, because Spark automatically persists some intermediate data in shuffle operations, even without users calling persist.

      Parameters:
      transform - the transform to check
      pvalue - output of transform
      Returns:
      if PCollection will be cached
    • putDataset

      public void putDataset(PTransform<?,? extends PValue> transform, Dataset dataset)
      Add single output of transform to context map and possibly cache if it conforms shouldCache(PTransform, PValue).
      Parameters:
      transform - from which Dataset was created
      dataset - created Dataset from transform
    • putDataset

      public void putDataset(PValue pvalue, Dataset dataset)
      Add output of transform to context map and possibly cache if it conforms shouldCache(PTransform, PValue). Used when PTransform has multiple outputs.
      Parameters:
      pvalue - one of multiple outputs of transform
      dataset - created Dataset from transform
    • borrowDataset

      public Dataset borrowDataset(PTransform<? extends PValue,?> transform)
    • borrowDataset

      public Dataset borrowDataset(PValue pvalue)
    • computeOutputs

      public void computeOutputs()
      Computes the outputs for all RDDs that are leaves in the DAG and do not have any actions (like saving to a file) registered on them (i.e. they are performed for side effects).
    • get

      public <T> T get(PValue value)
      Retrieve an object of Type T associated with the PValue passed in.
      Type Parameters:
      T - Type of object to return.
      Parameters:
      value - PValue to retrieve associated data for.
      Returns:
      Native object.
    • getPViews

      public SparkPCollectionView getPViews()
      Return the current views creates in the pipeline.
      Returns:
      SparkPCollectionView
    • putPView

      public void putPView(PCollectionView<?> view, Iterable<WindowedValue<?>> value, Coder<Iterable<WindowedValue<?>>> coder)
      Adds/Replaces a view to the current views creates in the pipeline.
      Parameters:
      view - - Identifier of the view
      value - - Actual value of the view
      coder - - Coder of the value
    • getCacheCandidates

      public Map<PCollection,Long> getCacheCandidates()
      Get the map of cache candidates hold by the evaluation context.
      Returns:
      The current Map of cache candidates.
    • getCandidatesForGroupByKeyAndWindowTranslation

      public Map<GroupByKey<?,?>,String> getCandidatesForGroupByKeyAndWindowTranslation()
      Get the map of GBK transforms to their full names, which are candidates for group by key and window translation which aims to reduce memory usage.
      Returns:
      The current Map of candidates
    • isCandidateForGroupByKeyAndWindow

      public <K, V> boolean isCandidateForGroupByKeyAndWindow(GroupByKey<K,V> transform)
      Returns if given GBK transform can be considered as candidate for group by key and window translation aiming to reduce memory usage.
      Type Parameters:
      K - type of GBK key
      V - type of GBK value
      Parameters:
      transform - to evaluate
      Returns:
      true if given transform is a candidate; false otherwise
    • reportPCollectionConsumed

      public void reportPCollectionConsumed(PCollection<?> pCollection)
      Reports that given PCollection is consumed by a PTransform in the pipeline.
      See Also:
    • reportPCollectionProduced

      public void reportPCollectionProduced(PCollection<?> pCollection)
      Reports that given PCollection is consumed by a PTransform in the pipeline.
      See Also:
    • getPCollectionConsumptionMap

      public Map<PCollection<?>,Integer> getPCollectionConsumptionMap()
      Get the map of PCollection to the number of PTransform consuming it.
      Returns:
    • isLeaf

      public boolean isLeaf(PCollection<?> pCollection)
      Check if given PCollection is a leaf or not. PCollection is a leaf when there is no other PTransform consuming it / depending on it.
      Parameters:
      pCollection - to be checked if it is a leaf
      Returns:
      true if pCollection is leaf; otherwise false
    • storageLevel

      public String storageLevel()
    • isStreamingSideInput

      public boolean isStreamingSideInput()
      Checks if any of the side inputs in the pipeline are streaming side inputs.

      If at least one of the side inputs is a streaming side input, this method returns true. When streaming side inputs are present, the CachedSideInputReader will not be used.

      Returns:
      true if any of the side inputs in the pipeline are streaming side inputs, false otherwise
    • useStreamingSideInput

      public void useStreamingSideInput()
      Marks that the pipeline contains at least one streaming side input.

      When this method is called, it sets the streamingSideInput flag to true, indicating that the CachedSideInputReader should not be used for processing side inputs.