Apache Beam SDK for Python¶
Apache Beam provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.
The Apache Beam SDK for Python provides access to Apache Beam capabilities from the Python programming language.
Status¶
The SDK is still early in its development, and significant changes should be expected before the first stable version.
Overview¶
The key concepts in this programming model are
PCollection
: represents a collection of data, which could be bounded or unbounded in size.PTransform
: represents a computation that transforms input PCollections into output PCollections.Pipeline
: manages a directed acyclic graph ofPTransform
s andPCollection
s that is ready for execution.PipelineRunner
: specifies where and how the pipeline should execute.Read
: read from an external source.Write
: write to an external data sink.
Typical usage¶
At the top of your source file:
import apache_beam as beam
After this import statement
- Transform classes are available as
beam.FlatMap
,beam.GroupByKey
, etc. - Pipeline class is available as
beam.Pipeline
- Text read/write transforms are available as
beam.io.ReadFromText
,beam.io.WriteToText
.
Examples
The examples subdirectory has some examples.
Subpackages¶
- apache_beam.coders package
- apache_beam.internal package
- apache_beam.io package
- Subpackages
- apache_beam.io.gcp package
- Subpackages
- apache_beam.io.gcp.datastore package
- Subpackages
- apache_beam.io.gcp.datastore.v1 package
- Submodules
- apache_beam.io.gcp.datastore.v1.adaptive_throttler module
- apache_beam.io.gcp.datastore.v1.datastoreio module
- apache_beam.io.gcp.datastore.v1.fake_datastore module
- apache_beam.io.gcp.datastore.v1.helper module
- apache_beam.io.gcp.datastore.v1.query_splitter module
- apache_beam.io.gcp.datastore.v1.util module
- Submodules
- apache_beam.io.gcp.datastore.v1 package
- Subpackages
- apache_beam.io.gcp.datastore package
- Submodules
- apache_beam.io.gcp.big_query_query_to_table_pipeline module
- apache_beam.io.gcp.bigquery module
- apache_beam.io.gcp.bigquery_io_read_pipeline module
- apache_beam.io.gcp.datastore_write_it_pipeline module
- apache_beam.io.gcp.gcsfilesystem module
- apache_beam.io.gcp.gcsio module
- apache_beam.io.gcp.pubsub module
- apache_beam.io.gcp.pubsub_it_pipeline module
- Subpackages
- apache_beam.io.gcp package
- Submodules
- apache_beam.io.avroio module
- apache_beam.io.concat_source module
- apache_beam.io.filebasedsink module
- apache_beam.io.filebasedsource module
- apache_beam.io.filesystem module
- apache_beam.io.filesystemio module
- apache_beam.io.filesystems module
- apache_beam.io.hadoopfilesystem module
- apache_beam.io.iobase module
- apache_beam.io.localfilesystem module
- apache_beam.io.range_trackers module
- apache_beam.io.restriction_trackers module
- apache_beam.io.source_test_utils module
- apache_beam.io.textio module
- apache_beam.io.tfrecordio module
- apache_beam.io.utils module
- apache_beam.io.vcfio module
- Subpackages
- apache_beam.metrics package
- apache_beam.options package
- apache_beam.portability package
- Subpackages
- apache_beam.portability.api package
- Submodules
- apache_beam.portability.api.beam_artifact_api_pb2_grpc module
- apache_beam.portability.api.beam_fn_api_pb2_grpc module
- apache_beam.portability.api.beam_job_api_pb2_grpc module
- apache_beam.portability.api.beam_provision_api_pb2_grpc module
- apache_beam.portability.api.beam_runner_api_pb2_grpc module
- apache_beam.portability.api.endpoints_pb2_grpc module
- apache_beam.portability.api.standard_window_fns_pb2_grpc module
- Submodules
- apache_beam.portability.api package
- Submodules
- Subpackages
- apache_beam.runners package
- Subpackages
- apache_beam.runners.dataflow package
- apache_beam.runners.direct package
- Submodules
- apache_beam.runners.direct.bundle_factory module
- apache_beam.runners.direct.clock module
- apache_beam.runners.direct.consumer_tracking_pipeline_visitor module
- apache_beam.runners.direct.direct_metrics module
- apache_beam.runners.direct.direct_runner module
- apache_beam.runners.direct.direct_userstate module
- apache_beam.runners.direct.evaluation_context module
- apache_beam.runners.direct.executor module
- apache_beam.runners.direct.helper_transforms module
- apache_beam.runners.direct.sdf_direct_runner module
- apache_beam.runners.direct.test_direct_runner module
- apache_beam.runners.direct.transform_evaluator module
- apache_beam.runners.direct.util module
- apache_beam.runners.direct.watermark_manager module
- Submodules
- apache_beam.runners.interactive package
- apache_beam.runners.job package
- Submodules
- Subpackages
- apache_beam.testing package
- apache_beam.tools package
- apache_beam.transforms package
- Submodules
- apache_beam.transforms.combiners module
- apache_beam.transforms.core module
- apache_beam.transforms.create_source module
- apache_beam.transforms.display module
- apache_beam.transforms.ptransform module
- apache_beam.transforms.sideinputs module
- apache_beam.transforms.timeutil module
- apache_beam.transforms.trigger module
- apache_beam.transforms.userstate module
- apache_beam.transforms.util module
- apache_beam.transforms.window module
- Submodules
- apache_beam.typehints package
- apache_beam.utils package