Welcome to our learning resources. This page contains a collection of resources that will help you to get started and use Apache Beam. If you’re just starting, you can view this as a guided tour, otherwise you can jump straight to any section of your interest.
If you have additional material that you would like to see here, please let us know at email@example.com!
- Getting Started
- Interactive Labs
- Code Examples
- API Reference
- Feedback and Suggestions
- How to Contribute
- Java Quickstart - How to set up and run a WordCount pipeline on the Java SDK.
- Python Quickstart - How to set up and run a WordCount pipeline on the Python SDK.
- Go Quickstart - How to set up and run a WordCount pipeline on the Go SDK.
- Java Development Environment - Setting up a Java development environment for Apache Beam using IntelliJ and Maven.
- Python Development Environment - Setting up a Python development environment for Apache Beam using PyCharm.
Learning the Basics
- WordCount - Walks you through the code of a simple WordCount pipeline. This is a very basic pipeline intended to show the most basic concepts of data processing. WordCount is the “Hello World” for data processing.
- Mobile Gaming - Introduces how to consider time while processing data, user defined transforms, windowing, filtering data, streaming pipelines, triggers, and session analysis. This is a great place to start once you get the hang of WordCount.
- Programming Guide - The Programming Guide contains more in-depth information on most topics in the Apache Beam SDK. These include descriptions on how everything works as well as code snippets to see how to use every part. This can be used as a reference guidebook.
- The world beyond batch: Streaming 101 - Covers some basic background information, terminology, time domains, batch processing, and streaming.
- The world beyond batch: Streaming 102 - Tour of the unified batch and streaming programming model in Beam, alongside with an example to explain many of the concepts.
- Apache Beam Execution Model - Explanation on how runners execute an Apache Beam pipeline. This includes why serialization is important, and how a runner might distribute the work in parallel to multiple machines.
- Common Use Case Patterns Part 1 - Common patterns such as writing data to multiple storage locations, slowly-changing lookup cache, calling external services, dealing with bad data, and starting jobs through a REST endpoint.
- Common Use Case Patterns Part 2 - Common patterns such as GroupBy using multiple data properties, joining two PCollections on a common key, streaming large lookup tables, merging two streams with different window lengths, and threshold detection with time-series data.
- Retry Policy - Adding a retry policy to a
- Predicting news social engagement - Using multiple data sources, many common design patterns, and sentiment analysis to get insights into different news articles for TensorFlow and Dataflow.
- Processing IoT Data - IoT sensors are continuously streaming data to the cloud. Learn how to handle the sensor data which can be useful for real-time monitoring, alerts, long-term data storage for analysis, performance improvement, and model training.
- Oracle Database to Google BigQuery - Migrate data from an Oracle Database into BigQuery using Dataprep.
- Google BigQuery to Google Datastore - Migrate data from a BigQuery table into Datastore without thinking of its schema.
- SAP HANA to Google BigQuery - Migrate data from a SAP HANA in-memory database into BigQuery.
- Machine Learning Preprocessing and Prediction - Predict the molecular energy from data stored in the Spatial Data File (SDF) format. Train a TensorFlow model with tf.Transform for preprocessing in Python. This also shows how to create batch and streaming prediction pipelines in Apache Beam.
- Machine Learning Preprocessing - Find the optimal parameter settings for simulated physical machines like a bottle filler or cookie machine. The goal of each simulated machine is to have the same input/output of the actual machine, making it a “digital twin”. This uses tf.Transform for preprocessing.
- Running on AppEngine - Use a Dataflow template to launch a pipeline from Google AppEngine, and how to run the pipeline periodically via a cron job.
- Stateful Processing - Learn how to access a persistent mutable state while processing input elements, this allows for side effects in a
DoFn. This can be used for arbitrary-but-consistent index assignment, if you want to assign a unique incrementing index to each incoming element where order doesn’t matter.
- Timely and Stateful Processing - An example on how to do batched RPC calls. The call requests are stored in a mutable state as they are received. Once there are either enough requests or a certain time has passed, the batch of requests is triggered to be sent.
- Running External Libraries - Call an external library written in a language that does not have a native SDK in Apache Beam such as C++.
- Big Data Text Processing Pipeline (40m) - Run a word count pipeline on the Dataflow runner.
- Real Time Machine Learning (45m) - Create a real-time flight delay prediction service using historical data on internal flights in the United States.
- Visualize Real-Time Geospatial Data (60m) - Process real-time streaming data from a real-time real world historical data set, store the results in BigQuery, and visualize the geospatial data on Data Studio.
- Processing Time Windowed Data (90m) - Implement time-windowed aggregation to augment the raw data in order to produce a consistent training and test datasets for a machine learning model.
- Python Qwik Start (30m) - Run a word count pipeline on the Dataflow runner.
- NDVI from Landsat Images (45m) - Process Landsat satellite data in a distributed environment to compute the Normalized Difference Vegetation Index (NDVI).
- Simulate historic flights (60m) - Simulate real-time historic internal flights in the United States and store the resulting simulated data in BigQuery.
- Snippets 1 - Commonly-used data analysis patterns such as how to use BigQuery, a CombinePerKey transform, remove duplicate lines in files, filtering, joining PCollections, getting the maximum value of a PCollection, etc.
- Snippets 2 - Additional examples on common tasks such as configuring BigQuery, PubSub, writing one file per window, etc.
- Complete Examples - End-to-end example pipelines such as an auto complete, a streaming word extract, calculating the Term Frequency-Inverse Document Frequency (TF-IDF), getting the top Wikipedia sessions, traffic max lane flow, traffic routes, etc.
- Snippets - Commonly-used data analysis patterns such as how to use BigQuery, Datastore, coders, combiners, filters, custom PTransforms, etc.
- Complete Examples - End-to-end example pipelines such as an auto complete, getting mobile gaming statistics, calculating the Julia set, solving distributing optimization tasks, estimating PI, calculating the Term Frequency-Inverse Document Frequency (TF-IDF), getting the top Wikipedia sessions, etc.
- Java API Reference - Official API Reference for the Java SDK.
- Python API Reference - Official API Reference for the Python SDK.
- Go API Reference - Official API Reference for the Go SDK.
Feedback and Suggestions
We are open for feedback and suggestions, you can find different ways to reach out to the community in the Contact Us page.
If you have a bug report or want to suggest a new feature, you can let us know by submitting a new issue.
How to Contribute
We welcome contributions from everyone! To learn more on how to contribute, check our Contribution Guide.