Self Link:  https://s.apache.org/beam-design-docs

Documents by category

Project Incubation (2016)

  • Original Drive Folder for Incubation Docs [Google Drive folder]
  • Technical Vision [doc], [slides]
  • Repository Structure [doc]
  • Flink runner: Current status and development roadmap [doc]
  • Spark Runner Technical Vision [doc]
  • PPMC deep dive [slides]

Beam Model

  • Checkpoints [doc]
  • A New DoFn [doc], [slides]
  • Proposed Splittable DoFn API changes [doc]
  • Splittable DoFn (Obsoletes Source API) [doc]
    • Reimplementing Beam API classes on top of Splittable DoFn on top of Source API [doc]
    • New TextIO features based on SDF [doc]
    • Watch transform [doc]
    • Bundles w/ SplittableDoFns [doc]
    • Custom Runner-issued Checkpoint [doc]
  • State and Timers for DoFn [doc]
    • Portable OrderedListState [doc]
  • ContextFn [doc]
  • Static Display Data [doc]
  • Lateness (and Panes) in Apache Beam [doc]
  • Triggers in Apache Beam [doc]
  • Triggering is for sinks [doc] (not implemented)
  • Guard against “Trigger Finishing” [doc]
  • Pipeline Drain [doc]
  • Pipelines Considered Harmful [doc]
  • Side-Channel Inputs [doc]
  • Dynamic Pipeline Options [doc]
  • SDK Support for Reading Dynamic PipelineOptions [doc]
  • Fine-grained Resource Configuration in Beam [doc]
  • External Join with KV Stores [doc]
  • Error Reporting Callback (WIP) [doc]
  • Snapshotting and Updating Beam Pipelines [doc]
  • Requiring PTransform to set a coder on its resulting collections [mail]
  • Support of @RequiresStableInput annotation [doc], [mail]
  • [PROPOSAL] @onwindowexpiration [mail]
  • AutoValue Coding and Row Support [doc]
  • HyperLogLog++ Integration with Apache Beam [doc]
  • Retractions [doc]
  • @RequiresTimeSortedInput annotation for stateful DoFns [doc]
  • GroupIntoBatches with Runner Determined Sharding [doc]

  • Runner and Fn API StateSpec Mismatch [doc]

IO / Filesystem

  • IOChannelFactory Redesign [doc]
  • Configurable BeamFileSystem [doc]
  • New API for writing files in Beam [doc]
  • Dynamic file-based sinks [doc]
  • Beam GCP Debuggability Metrics [doc]
  • KafkaIO
    • Event Time and Watermarks in KafkaIO [doc]
    • Exactly-once Kafka sink [doc]
    • KafkaIO Dynamic Read [doc]
  • CDAP IO [doc]
  • Schema Aware Beam IOs [doc]
  • Client-Side Throttling Overview [doc]

Metrics

  • Defining and Adding SDK Metrics via FN API [doc]
  • Histogram Style Metrics - [doc]
  • Get Metrics API: Metric Extraction via proto RPC API. [doc]
  • Metrics API [doc]
  • I/O Metrics [doc]
  • Metrics extraction independent from runners / execution engines [doc]
  • Watermark Metrics [doc]
  • Support Dropwizard Metrics in Beam [doc]
  • Beam GCP Debuggability Metrics [doc]

Runners

  • Runner Authoring Guide [doc] (obsoletes [doc] and [doc])
  • Composite PInputs, POutputs, and the Runner API [doc]
  • Side Input Architecture for Apache Beam [doc]
  • Runner supported features plugin [doc]
  • Structured streaming Spark Runner [doc]
  • SDF Self-checkpoint Support on Portable Flink [doc]

SQL / Schema

  • Streams and Tables [doc]
  • Streaming SQL [doc]
  • Schema-Aware PCollections [doc]
  • Pubsub to Beam SQL [doc]
  • Apache Beam Proposal: design of DSL SQL interface [doc]
  • Calcite/Beam SQL Windowing [doc]
  • Reject Unsupported Windowing Strategies in JOIN [doc]
  • Beam DSL_SQL branch API review [doc]
  • Complex Types Support for Beam SQL DDL [mail]
  • [SQL] Reject unsupported inputs to Joins [mail]
  • Integrating runners & IO [doc]
  • Beam SQL Pipeline Options [doc]
  • Unbounded limit [doc]
  • Portable Beam Schemas [doc]
  • Cost Based Optimizer [doc1, doc2]
  • ZetaSQL as a dialect in BeamSQL [doc]
  • Project and predicate push-down [doc]

Portability

  • Portability Framework
    • The model protos contain all aspects of the portability API and is the truth on the ground. The proto definitions supercede any design documents. The main design documents are the following:
    • Runner API. Pipeline representation and discussion on primitive/composite transforms and optimizations.

    • Job API. Job submission and management protocol.

    • Fn API. Execution-side control and data protocols and overview.

    • Container contract. Execution-side docker container invocation and provisioning protocols. See CONTAINERS.md for how to build container images.

    • Cross language. Options and tradeoffs for how to handle various kinds of multi-language/multi-SDK pipelines.
  • Fn API
    • Apache Beam Fn API Overview [doc]
    • Processing a Bundle [doc]
    • Progress [doc]
    • Graphical view of progress [doc]
    • Fn State API and Bundle Processing [doc]
    • Checkpointing and splitting of Beam bundles over the Fn API, with application to SDF [doc]
    • How to send and receive data [doc]
    • Defining and adding SDK Metrics [doc]
    • SDK harness container contract [doc]
    • Structure and Lifting of Combines [doc]
  • SDK X with Runner Y using Runner API [doc]
  • Flink Portable Runner Overview [doc]
  • Launching portable pipeline on Flink Runner [doc]
  • Portability support [table]
  • Portability Prototype [doc]
  • Portable Artifact Staging [doc]
  • Portable Beam on Flink [doc]
  • Portability API: How to Checkpoint and Split Bundles [doc]
  • Portability API: How to Finalize Bundles [doc]
  • Side Input in Universal Reference Runner [doc]
  • Spark Portable Runner Overview [doc]
  • Cross-Language
  • Environment Resources and Annotations [doc]

Build / Testing

  • More Expressive PAsserts [doc]
  • Mergebot design document [doc]
  • Performance tests for commonly used file-based I/O PTransforms [doc]
  • Performance tests results analysis and basic regression detection [doc]
  • Eventual PAssert [doc]
  • Testing I/O Transforms in Apache Beam [doc]
  • Reproducible Environment for Jenkins Tests By Using Container [doc]
  • Keeping precommit times fast [doc]
  • Increase Beam post-commit tests stability [doc]
  • Beam-Site Automation Reliability [doc]
  • Managing outdated dependencies [doc]
  • Automation For Beam Dependency Check [doc]
  • Test performance of core Apache Beam operations [doc]
  • Add static code analysis quality gates to Beam [doc]
  • Portable batch & streaming load tests in all sdks [doc]
  • Storing, displaying and detecting anomalies in test results [doc]
  • Add ARM Support to Beam SDK Container Images [doc]

Deployment

  • Beam on Flink on Kubernetes [doc]

Python

  • Beam Python User State and Timer APIs [doc]
  • Python Kafka connector [doc]
  • Python 3 support [doc]
  • Splittable DoFn for Python SDK [doc]
  • Parquet IO for Python SDK [doc]
  • Building Python Wheels [doc]
  • Beam Type Hints for Python 3 [doc]
  • Pandas Dataframe API for Beam [doc]
  • Batched DoFns [doc]
  • PEP 585 Type Hints for Python 3.9+ [doc]
  • The Current State of Beam Python Type Hinting (as of 2.52.0) [doc]
  • Enrichment transform [doc]

Go

  • Apache Beam Go SDK design [doc]
  • Go SDK Vanity Import Path [doc] (unimplemented)
    • Needs to be adjusted to account for Go Modules.
  • Go SDK Integration Tests [doc]
  • Design RFC
    • Assumes Beam knowledge, but points out how Go's features informed the SDK design.
  • User Defined Coders + Original Schema Sketch 
  • Splittable DoFns for the Go SDK [doc]
  • Self-Checkpointing SDFs for the Go SDK [doc]
  • Bundle Finalization in the Go SDK [doc]
  • Watermark Estimation in the Go SDK [doc]
  • State and Timers in the Go SDK [doc]
  • Using Generics for Registration [doc]
  • Side Input Window Mapping [doc]
  • MultiMap Side Input Support [doc]
  • One-Pagers:
    • Investigation: Go Expansion Service Auto-Startup for Dev Environments [doc]

Machine Learning

  • Custom Inference Functions [doc]
  • Model Updates using Side Inputs [doc]
  • RunInference: ML Inference in Beam [doc]
  • beam.MLTransform [ doc ]
  • Embeddings in MLTransform [doc]
  • TensorFlow Model Handler [doc]
  • Hugging Face Model Handler [doc]
  • Per Key Inference [doc]
  • Benchmarking RunInference with Multi-Process Shared Models [doc]

Other

  • Euphoria - High-Level Java 8 DSL [doc]
  • Apache Beam Code Review Guide [doc]
  • Nexmark - Nexmark
  • Slowly Changing Side Inputs (or Slowly Changing Dimensions Support) [doc]

Some of documents are available on this google drive

To add new design document it is recommended to use this design document template

  • No labels