Class SparkRunner

java.lang.Object
org.apache.beam.sdk.PipelineRunner<SparkPipelineResult>
org.apache.beam.runners.spark.SparkRunner

public final class SparkRunner extends PipelineRunner<SparkPipelineResult>
The SparkRunner translate operations defined on a pipeline to a representation executable by Spark, and then submitting the job to Spark to be executed. If we wanted to run a Beam pipeline with the default options of a single threaded spark instance in local mode, we would do the following:

Pipeline p = [logic for pipeline creation] SparkPipelineResult result = (SparkPipelineResult) p.run();

To create a pipeline runner to run against a different spark cluster, with a custom master url we would do the following:

Pipeline p = [logic for pipeline creation] SparkPipelineOptions options = SparkPipelineOptionsFactory.create(); options.setSparkMaster("spark://host:port"); SparkPipelineResult result = (SparkPipelineResult) p.run();

  • Method Details

    • create

      public static SparkRunner create()
      Creates and returns a new SparkRunner with default options. In particular, against a spark instance running in local mode.
      Returns:
      A pipeline runner with default options.
    • create

      public static SparkRunner create(SparkPipelineOptions options)
      Creates and returns a new SparkRunner with specified options.
      Parameters:
      options - The SparkPipelineOptions to use when executing the job.
      Returns:
      A pipeline runner that will execute with specified options.
    • fromOptions

      public static SparkRunner fromOptions(PipelineOptions options)
      Creates and returns a new SparkRunner with specified options.
      Parameters:
      options - The PipelineOptions to use when executing the job.
      Returns:
      A pipeline runner that will execute with specified options.
    • run

      public SparkPipelineResult run(Pipeline pipeline)
      Description copied from class: PipelineRunner
      Processes the given Pipeline, potentially asynchronously, returning a runner-specific type of result.
      Specified by:
      run in class PipelineRunner<SparkPipelineResult>
    • initAccumulators

      public static void initAccumulators(SparkPipelineOptions opts, org.apache.spark.api.java.JavaSparkContext jsc)
      Init Metrics/Aggregators accumulators. This method is idempotent.
    • updateCacheCandidates

      public static void updateCacheCandidates(Pipeline pipeline, SparkPipelineTranslator translator, EvaluationContext evaluationContext)
      Evaluator that update/populate the cache candidates.
    • updateDependentTransforms

      public static void updateDependentTransforms(Pipeline pipeline, SparkPipelineTranslator translator, EvaluationContext evaluationContext)
      Evaluator that update/populate information about dependent transforms for pCollections.