java.lang.Object

org.apache.beam.runners.spark.translation.EvaluationContext

public class EvaluationContext extends Object

The EvaluationContext allows us to define pipeline instructions and translate between


 PObject<T>

s or PCollection<T>s and Ts or DStreams/RDDs of Ts.

Constructor Summary

Constructors

Constructor

Description

EvaluationContext(org.apache.spark.api.java.JavaSparkContext jsc, Pipeline pipeline, PipelineOptions options)

EvaluationContext(org.apache.spark.api.java.JavaSparkContext jsc, Pipeline pipeline, PipelineOptions options, org.apache.spark.streaming.api.java.JavaStreamingContext jssc)
Method Summary

Modifier and Type

Method

Description

Dataset

borrowDataset(PTransform<? extends PValue,?> transform)

Dataset

borrowDataset(PValue pvalue)

void

computeOutputs()

Computes the outputs for all RDDs that are leaves in the DAG and do not have any actions (like saving to a file) registered on them (i.e.

<T> T

get(PValue value)

Retrieve an object of Type T associated with the PValue passed in.

Map<PCollection,Long>

getCacheCandidates()

Get the map of cache candidates hold by the evaluation context.

Map<GroupByKey<?,?>,String>

getCandidatesForGroupByKeyAndWindowTranslation()

Get the map of GBK transforms to their full names, which are candidates for group by key and window translation which aims to reduce memory usage.

org.apache.beam.sdk.runners.AppliedPTransform<?,?,?>

getCurrentTransform()

<T extends PValue> T

getInput(PTransform<T,?> transform)

<T> Map<TupleTag<?>,PCollection<?>>

getInputs(PTransform<?,?> transform)

PipelineOptions

getOptions()

<T extends PValue> T

getOutput(PTransform<?,T> transform)

Map<TupleTag<?>,Coder<?>>

getOutputCoders()

Map<TupleTag<?>,PCollection<?>>

getOutputs(PTransform<?,?> transform)

Map<PCollection<?>,Integer>

getPCollectionConsumptionMap()

Get the map of PCollection to the number of PTransform consuming it.

Pipeline

getPipeline()

SparkPCollectionView

getPViews()

Return the current views creates in the pipeline.

org.apache.beam.runners.core.construction.SerializablePipelineOptions

getSerializableOptions()

org.apache.spark.api.java.JavaSparkContext

getSparkContext()

org.apache.spark.streaming.api.java.JavaStreamingContext

getStreamingContext()

<K, V> boolean

isCandidateForGroupByKeyAndWindow(GroupByKey<K,V> transform)

Returns if given GBK transform can be considered as candidate for group by key and window translation aiming to reduce memory usage.

boolean

isLeaf(PCollection<?> pCollection)

Check if given PCollection is a leaf or not.

boolean

isStreamingSideInput()

Checks if any of the side inputs in the pipeline are streaming side inputs.

void

putDataset(PTransform<?,? extends PValue> transform, Dataset dataset)

Add single output of transform to context map and possibly cache if it conforms shouldCache(PTransform, PValue).

void

putDataset(PValue pvalue, Dataset dataset)

Add output of transform to context map and possibly cache if it conforms shouldCache(PTransform, PValue).

void

putPView(PCollectionView<?> view, Iterable<WindowedValue<?>> value, Coder<Iterable<WindowedValue<?>>> coder)

Adds/Replaces a view to the current views creates in the pipeline.

void

reportPCollectionConsumed(PCollection<?> pCollection)

Reports that given PCollection is consumed by a PTransform in the pipeline.

void

reportPCollectionProduced(PCollection<?> pCollection)

Reports that given PCollection is consumed by a PTransform in the pipeline.

void

setCurrentTransform(org.apache.beam.sdk.runners.AppliedPTransform<?,?,?> transform)

boolean

shouldCache(PTransform<?,? extends PValue> transform, PValue pvalue)

Cache PCollection if SparkPipelineOptions.isCacheDisabled is false or transform isn't GroupByKey transformation and PCollection is used more then once in Pipeline.

String

storageLevel()

void

useStreamingSideInput()

Marks that the pipeline contains at least one streaming side input.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- EvaluationContext
  
  public EvaluationContext(org.apache.spark.api.java.JavaSparkContext jsc, Pipeline pipeline, PipelineOptions options)
- EvaluationContext
  
  public EvaluationContext(org.apache.spark.api.java.JavaSparkContext jsc, Pipeline pipeline, PipelineOptions options, org.apache.spark.streaming.api.java.JavaStreamingContext jssc)
Method Details
- getSparkContext
  
  public org.apache.spark.api.java.JavaSparkContext getSparkContext()
- getStreamingContext
  
  public org.apache.spark.streaming.api.java.JavaStreamingContext getStreamingContext()
- getPipeline
  
  public Pipeline getPipeline()
- getOptions
  
  public PipelineOptions getOptions()
- getSerializableOptions
  
  public org.apache.beam.runners.core.construction.SerializablePipelineOptions getSerializableOptions()
- setCurrentTransform
  
  public void setCurrentTransform(org.apache.beam.sdk.runners.AppliedPTransform<?,?,?> transform)
- getCurrentTransform
  
  public org.apache.beam.sdk.runners.AppliedPTransform<?,?,?> getCurrentTransform()
- getInput
  
  public <T extends PValue> T getInput(PTransform<T,?> transform)
- getInputs
  
  public <T> Map<TupleTag<?>,PCollection<?>> getInputs(PTransform<?,?> transform)
- getOutput
  
  public <T extends PValue> T getOutput(PTransform<?,T> transform)
- getOutputs
  
  public Map<TupleTag<?>,PCollection<?>> getOutputs(PTransform<?,?> transform)
- getOutputCoders
  
  public Map<TupleTag<?>,Coder<?>> getOutputCoders()
- shouldCache
  
  public boolean shouldCache(PTransform<?,? extends PValue> transform, PValue pvalue)
  
  Cache PCollection if SparkPipelineOptions.isCacheDisabled is false or transform isn't GroupByKey transformation and PCollection is used more then once in Pipeline.
  PCollection is not cached in GroupByKey transformation, because Spark automatically persists some intermediate data in shuffle operations, even without users calling persist.
  
  Parameters:
  
  transform - the transform to check
  
  pvalue - output of transform
  
  Returns:
  
  if PCollection will be cached
- putDataset
  
  public void putDataset(PTransform<?,? extends PValue> transform, Dataset dataset)
  
  Add single output of transform to context map and possibly cache if it conforms shouldCache(PTransform, PValue).
  
  Parameters:
  
  transform - from which Dataset was created
  
  dataset - created Dataset from transform
- putDataset
  
  public void putDataset(PValue pvalue, Dataset dataset)
  
  Add output of transform to context map and possibly cache if it conforms shouldCache(PTransform, PValue). Used when PTransform has multiple outputs.
  
  Parameters:
  
  pvalue - one of multiple outputs of transform
  
  dataset - created Dataset from transform
- borrowDataset
  
  public Dataset borrowDataset(PTransform<? extends PValue,?> transform)
- borrowDataset
  
  public Dataset borrowDataset(PValue pvalue)
- computeOutputs
  
  public void computeOutputs()
  
  Computes the outputs for all RDDs that are leaves in the DAG and do not have any actions (like saving to a file) registered on them (i.e. they are performed for side effects).
- get
  
  public <T> T get(PValue value)
  
  Retrieve an object of Type T associated with the PValue passed in.
  
  Type Parameters:
  
  T - Type of object to return.
  
  Parameters:
  
  value - PValue to retrieve associated data for.
  
  Returns:
  
  Native object.
- getPViews
  
  public SparkPCollectionView getPViews()
  
  Return the current views creates in the pipeline.
  
  Returns:
  
  SparkPCollectionView
- putPView
  
  public void putPView(PCollectionView<?> view, Iterable<WindowedValue<?>> value, Coder<Iterable<WindowedValue<?>>> coder)
  
  Adds/Replaces a view to the current views creates in the pipeline.
  
  Parameters:
  
  view - - Identifier of the view
  
  value - - Actual value of the view
  
  coder - - Coder of the value
- getCacheCandidates
  
  public Map<PCollection,Long> getCacheCandidates()
  
  Get the map of cache candidates hold by the evaluation context.
  
  Returns:
  
  The current Map of cache candidates.
- getCandidatesForGroupByKeyAndWindowTranslation
  
  public Map<GroupByKey<?,?>,String> getCandidatesForGroupByKeyAndWindowTranslation()
  
  Get the map of GBK transforms to their full names, which are candidates for group by key and window translation which aims to reduce memory usage.
  
  Returns:
  
  The current Map of candidates
- isCandidateForGroupByKeyAndWindow
  
  public <K, V> boolean isCandidateForGroupByKeyAndWindow(GroupByKey<K,V> transform)
  
  Returns if given GBK transform can be considered as candidate for group by key and window translation aiming to reduce memory usage.
  
  Type Parameters:
  
  K - type of GBK key
  
  V - type of GBK value
  
  Parameters:
  
  transform - to evaluate
  
  Returns:
  
  true if given transform is a candidate; false otherwise
- reportPCollectionConsumed
  
  public void reportPCollectionConsumed(PCollection<?> pCollection)
  
  Reports that given PCollection is consumed by a PTransform in the pipeline.
  See Also:
  
  isLeaf(PCollection)
- reportPCollectionProduced
  
  public void reportPCollectionProduced(PCollection<?> pCollection)
  
  Reports that given PCollection is consumed by a PTransform in the pipeline.
  See Also:
  
  isLeaf(PCollection)
- getPCollectionConsumptionMap
  
  public Map<PCollection<?>,Integer> getPCollectionConsumptionMap()
  
  Get the map of PCollection to the number of PTransform consuming it.
  
  Returns:
- isLeaf
  
  public boolean isLeaf(PCollection<?> pCollection)
  
  Check if given PCollection is a leaf or not. PCollection is a leaf when there is no other PTransform consuming it / depending on it.
  
  Parameters:
  
  pCollection - to be checked if it is a leaf
  
  Returns:
  
  true if pCollection is leaf; otherwise false
- storageLevel
  
  public String storageLevel()
- isStreamingSideInput
  
  public boolean isStreamingSideInput()
  
  Checks if any of the side inputs in the pipeline are streaming side inputs.
  If at least one of the side inputs is a streaming side input, this method returns true. When streaming side inputs are present, the CachedSideInputReader will not be used.
  
  Returns:
  
  true if any of the side inputs in the pipeline are streaming side inputs, false otherwise
- useStreamingSideInput
  
  public void useStreamingSideInput()
  
  Marks that the pipeline contains at least one streaming side input.
  When this method is called, it sets the streamingSideInput flag to true, indicating that the CachedSideInputReader should not be used for processing side inputs.

Class EvaluationContext

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

EvaluationContext

EvaluationContext

Method Details

getSparkContext

getStreamingContext

getPipeline

getOptions

getSerializableOptions

setCurrentTransform

getCurrentTransform

getInput

getInputs

getOutput

getOutputs

getOutputCoders

shouldCache

putDataset

putDataset

borrowDataset

borrowDataset

computeOutputs

get

getPViews

putPView

getCacheCandidates

getCandidatesForGroupByKeyAndWindowTranslation

isCandidateForGroupByKeyAndWindow

reportPCollectionConsumed

reportPCollectionProduced

getPCollectionConsumptionMap

isLeaf

storageLevel

isStreamingSideInput

useStreamingSideInput