Apache Beam SDK for Python¶
Apache Beam provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.
The Apache Beam SDK for Python provides access to Apache Beam capabilities from the Python programming language.
Status¶
The SDK is still early in its development, and significant changes should be expected before the first stable version.
Overview¶
The key concepts in this programming model are
- PCollection: represents a collection of data, which could be bounded or unbounded in size.
- PTransform: represents a computation that transforms input PCollections into output PCollections.
- Pipeline: manages a directed acyclic graph of- PTransforms and- PCollections that is ready for execution.
- PipelineRunner: specifies where and how the pipeline should execute.
- Read: read from an external source.
- Write: write to an external data sink.
Typical usage¶
At the top of your source file:
import apache_beam as beam
After this import statement
- Transform classes are available as
beam.FlatMap,beam.GroupByKey, etc.
- Pipeline class is available as
beam.Pipeline
- Text read/write transforms are available as
beam.io.ReadFromText,beam.io.WriteToText.
Examples
The examples subdirectory has some examples.
Subpackages¶
- apache_beam.coders package
- apache_beam.dataframe package- Submodules- apache_beam.dataframe.convert module
- apache_beam.dataframe.doctests module
- apache_beam.dataframe.expressions module
- apache_beam.dataframe.frame_base module
- apache_beam.dataframe.frames module
- apache_beam.dataframe.io module
- apache_beam.dataframe.pandas_top_level_functions module
- apache_beam.dataframe.partitionings module
- apache_beam.dataframe.schemas module
- apache_beam.dataframe.transforms module
 
 
- Submodules
- apache_beam.io package- Subpackages- apache_beam.io.aws package
- apache_beam.io.azure package
- apache_beam.io.external package
- apache_beam.io.flink package
- apache_beam.io.gcp package- Subpackages- apache_beam.io.gcp.datastore package- Subpackages- apache_beam.io.gcp.datastore.v1new package- Submodules- apache_beam.io.gcp.datastore.v1new.adaptive_throttler module
- apache_beam.io.gcp.datastore.v1new.datastore_write_it_pipeline module
- apache_beam.io.gcp.datastore.v1new.datastoreio module
- apache_beam.io.gcp.datastore.v1new.helper module
- apache_beam.io.gcp.datastore.v1new.query_splitter module
- apache_beam.io.gcp.datastore.v1new.rampup_throttling_fn module
- apache_beam.io.gcp.datastore.v1new.types module
- apache_beam.io.gcp.datastore.v1new.util module
 
 
- Submodules
 
- apache_beam.io.gcp.datastore.v1new package
 
- Subpackages
- apache_beam.io.gcp.experimental package
- apache_beam.io.gcp.pubsublite package
 
- apache_beam.io.gcp.datastore package
- Submodules- apache_beam.io.gcp.big_query_query_to_table_pipeline module
- apache_beam.io.gcp.bigquery module
- apache_beam.io.gcp.bigquery_avro_tools module
- apache_beam.io.gcp.bigquery_file_loads module
- apache_beam.io.gcp.bigquery_io_metadata module
- apache_beam.io.gcp.bigquery_io_read_pipeline module
- apache_beam.io.gcp.bigquery_read_internal module
- apache_beam.io.gcp.bigquery_schema_tools module
- apache_beam.io.gcp.bigquery_tools module
- apache_beam.io.gcp.bigtableio module
- apache_beam.io.gcp.dicomclient module
- apache_beam.io.gcp.dicomio module
- apache_beam.io.gcp.gce_metadata_util module
- apache_beam.io.gcp.gcsfilesystem module
- apache_beam.io.gcp.gcsio module
- apache_beam.io.gcp.gcsio_overrides module
- apache_beam.io.gcp.pubsub module
- apache_beam.io.gcp.pubsub_it_pipeline module
- apache_beam.io.gcp.resource_identifiers module
- apache_beam.io.gcp.spanner module
 
 
- Subpackages
 
- Submodules- apache_beam.io.avroio module
- apache_beam.io.concat_source module
- apache_beam.io.debezium module
- apache_beam.io.filebasedsink module
- apache_beam.io.filebasedsource module
- apache_beam.io.fileio module
- apache_beam.io.filesystem module
- apache_beam.io.filesystemio module
- apache_beam.io.filesystems module
- apache_beam.io.hadoopfilesystem module
- apache_beam.io.iobase module
- apache_beam.io.jdbc module
- apache_beam.io.kafka module
- apache_beam.io.kinesis module
- apache_beam.io.localfilesystem module
- apache_beam.io.mongodbio module
- apache_beam.io.parquetio module
- apache_beam.io.range_trackers module
- apache_beam.io.restriction_trackers module
- apache_beam.io.snowflake module
- apache_beam.io.source_test_utils module
- apache_beam.io.textio module
- apache_beam.io.tfrecordio module
- apache_beam.io.utils module
- apache_beam.io.watermark_estimators module
 
 
- Subpackages
- apache_beam.metrics package
- apache_beam.ml package
- apache_beam.options package
- apache_beam.portability package- Subpackages- apache_beam.portability.api package- Subpackages- apache_beam.portability.api.org package- Subpackages- apache_beam.portability.api.org.apache package- Subpackages- apache_beam.portability.api.org.apache.beam package- Subpackages- apache_beam.portability.api.org.apache.beam.model package- Subpackages- apache_beam.portability.api.org.apache.beam.model.fn_execution package
- apache_beam.portability.api.org.apache.beam.model.interactive package
- apache_beam.portability.api.org.apache.beam.model.job_management package- Subpackages- apache_beam.portability.api.org.apache.beam.model.job_management.v1 package- Submodules- apache_beam.portability.api.org.apache.beam.model.job_management.v1.beam_artifact_api_pb2_grpc module
- apache_beam.portability.api.org.apache.beam.model.job_management.v1.beam_artifact_api_pb2_urns module
- apache_beam.portability.api.org.apache.beam.model.job_management.v1.beam_expansion_api_pb2_grpc module
- apache_beam.portability.api.org.apache.beam.model.job_management.v1.beam_job_api_pb2_grpc module
 
 
- Submodules
 
- apache_beam.portability.api.org.apache.beam.model.job_management.v1 package
 
- Subpackages
- apache_beam.portability.api.org.apache.beam.model.pipeline package- Subpackages- apache_beam.portability.api.org.apache.beam.model.pipeline.v1 package- Submodules- apache_beam.portability.api.org.apache.beam.model.pipeline.v1.beam_runner_api_pb2_grpc module
- apache_beam.portability.api.org.apache.beam.model.pipeline.v1.beam_runner_api_pb2_urns module
- apache_beam.portability.api.org.apache.beam.model.pipeline.v1.endpoints_pb2_grpc module
- apache_beam.portability.api.org.apache.beam.model.pipeline.v1.external_transforms_pb2_grpc module
- apache_beam.portability.api.org.apache.beam.model.pipeline.v1.external_transforms_pb2_urns module
- apache_beam.portability.api.org.apache.beam.model.pipeline.v1.metrics_pb2_grpc module
- apache_beam.portability.api.org.apache.beam.model.pipeline.v1.metrics_pb2_urns module
- apache_beam.portability.api.org.apache.beam.model.pipeline.v1.schema_pb2_grpc module
- apache_beam.portability.api.org.apache.beam.model.pipeline.v1.schema_pb2_urns module
- apache_beam.portability.api.org.apache.beam.model.pipeline.v1.standard_window_fns_pb2_grpc module
- apache_beam.portability.api.org.apache.beam.model.pipeline.v1.standard_window_fns_pb2_urns module
 
 
- Submodules
 
- apache_beam.portability.api.org.apache.beam.model.pipeline.v1 package
 
- Subpackages
 
 
- Subpackages
 
- apache_beam.portability.api.org.apache.beam.model package
 
- Subpackages
 
- apache_beam.portability.api.org.apache.beam package
 
- Subpackages
 
- apache_beam.portability.api.org.apache package
 
- Subpackages
 
- apache_beam.portability.api.org package
 
- Subpackages
 
- apache_beam.portability.api package
- Submodules
 
- Subpackages
- apache_beam.runners package- Subpackages- apache_beam.runners.dask package
- apache_beam.runners.dataflow package- Subpackages
- Submodules- apache_beam.runners.dataflow.dataflow_exercise_metrics_pipeline module
- apache_beam.runners.dataflow.dataflow_exercise_streaming_metrics_pipeline module
- apache_beam.runners.dataflow.dataflow_job_service module
- apache_beam.runners.dataflow.dataflow_metrics module
- apache_beam.runners.dataflow.dataflow_runner module
- apache_beam.runners.dataflow.ptransform_overrides module
- apache_beam.runners.dataflow.test_dataflow_runner module
 
 
- apache_beam.runners.direct package- Submodules- apache_beam.runners.direct.bundle_factory module
- apache_beam.runners.direct.clock module
- apache_beam.runners.direct.consumer_tracking_pipeline_visitor module
- apache_beam.runners.direct.direct_metrics module
- apache_beam.runners.direct.direct_runner module
- apache_beam.runners.direct.direct_userstate module
- apache_beam.runners.direct.evaluation_context module
- apache_beam.runners.direct.executor module
- apache_beam.runners.direct.helper_transforms module
- apache_beam.runners.direct.sdf_direct_runner module
- apache_beam.runners.direct.test_direct_runner module
- apache_beam.runners.direct.test_stream_impl module
- apache_beam.runners.direct.transform_evaluator module
- apache_beam.runners.direct.util module
- apache_beam.runners.direct.watermark_manager module
 
 
- Submodules
- apache_beam.runners.interactive package- Subpackages- apache_beam.runners.interactive.caching package- Submodules- apache_beam.runners.interactive.caching.cacheable module
- apache_beam.runners.interactive.caching.expression_cache module
- apache_beam.runners.interactive.caching.read_cache module
- apache_beam.runners.interactive.caching.reify module
- apache_beam.runners.interactive.caching.streaming_cache module
- apache_beam.runners.interactive.caching.write_cache module
 
 
- Submodules
- apache_beam.runners.interactive.dataproc package
- apache_beam.runners.interactive.display package- Submodules- apache_beam.runners.interactive.display.display_manager module
- apache_beam.runners.interactive.display.interactive_pipeline_graph module
- apache_beam.runners.interactive.display.pcoll_visualization module
- apache_beam.runners.interactive.display.pipeline_graph module
- apache_beam.runners.interactive.display.pipeline_graph_renderer module
 
 
- Submodules
- apache_beam.runners.interactive.messaging package
- apache_beam.runners.interactive.options package
- apache_beam.runners.interactive.sql package
- apache_beam.runners.interactive.testing package
 
- apache_beam.runners.interactive.caching package
- Submodules- apache_beam.runners.interactive.augmented_pipeline module
- apache_beam.runners.interactive.background_caching_job module
- apache_beam.runners.interactive.cache_manager module
- apache_beam.runners.interactive.interactive_beam module
- apache_beam.runners.interactive.interactive_environment module
- apache_beam.runners.interactive.interactive_runner module
- apache_beam.runners.interactive.pipeline_fragment module
- apache_beam.runners.interactive.pipeline_instrument module
- apache_beam.runners.interactive.recording_manager module
- apache_beam.runners.interactive.user_pipeline_tracker module
- apache_beam.runners.interactive.utils module
 
 
- Subpackages
- apache_beam.runners.job package
 
- Submodules
 
- Subpackages
- apache_beam.testing package- Subpackages- apache_beam.testing.benchmarks package- Subpackages- apache_beam.testing.benchmarks.nexmark package- Subpackages- apache_beam.testing.benchmarks.nexmark.models package
- apache_beam.testing.benchmarks.nexmark.queries package- Submodules- apache_beam.testing.benchmarks.nexmark.queries.nexmark_query_util module
- apache_beam.testing.benchmarks.nexmark.queries.query0 module
- apache_beam.testing.benchmarks.nexmark.queries.query1 module
- apache_beam.testing.benchmarks.nexmark.queries.query10 module
- apache_beam.testing.benchmarks.nexmark.queries.query11 module
- apache_beam.testing.benchmarks.nexmark.queries.query12 module
- apache_beam.testing.benchmarks.nexmark.queries.query2 module
- apache_beam.testing.benchmarks.nexmark.queries.query3 module
- apache_beam.testing.benchmarks.nexmark.queries.query4 module
- apache_beam.testing.benchmarks.nexmark.queries.query5 module
- apache_beam.testing.benchmarks.nexmark.queries.query6 module
- apache_beam.testing.benchmarks.nexmark.queries.query7 module
- apache_beam.testing.benchmarks.nexmark.queries.query8 module
- apache_beam.testing.benchmarks.nexmark.queries.query9 module
- apache_beam.testing.benchmarks.nexmark.queries.winning_bids module
 
 
- Submodules
 
- Submodules
 
- Subpackages
 
- apache_beam.testing.benchmarks.nexmark package
 
- Subpackages
- apache_beam.testing.load_tests package
 
- apache_beam.testing.benchmarks package
- Submodules- apache_beam.testing.datatype_inference module
- apache_beam.testing.extra_assertions module
- apache_beam.testing.metric_result_matchers module
- apache_beam.testing.pipeline_verifiers module
- apache_beam.testing.synthetic_pipeline module
- apache_beam.testing.test_pipeline module
- apache_beam.testing.test_stream module
- apache_beam.testing.test_stream_service module
- apache_beam.testing.test_utils module
- apache_beam.testing.util module
 
 
- Subpackages
- apache_beam.transforms package- Submodules- apache_beam.transforms.combinefn_lifecycle_pipeline module
- apache_beam.transforms.combiners module
- apache_beam.transforms.core module
- apache_beam.transforms.create_source module
- apache_beam.transforms.deduplicate module
- apache_beam.transforms.display module
- apache_beam.transforms.environments module
- apache_beam.transforms.external module
- apache_beam.transforms.external_java module
- apache_beam.transforms.fully_qualified_named_transform module
- apache_beam.transforms.periodicsequence module
- apache_beam.transforms.ptransform module
- apache_beam.transforms.resources module
- apache_beam.transforms.sideinputs module
- apache_beam.transforms.sql module
- apache_beam.transforms.stats module
- apache_beam.transforms.timeutil module
- apache_beam.transforms.trigger module
- apache_beam.transforms.userstate module
- apache_beam.transforms.util module
- apache_beam.transforms.window module
 
 
- Submodules
- apache_beam.typehints package- Submodules- apache_beam.typehints.arrow_batching_microbenchmark module
- apache_beam.typehints.arrow_type_compatibility module
- apache_beam.typehints.batch module
- apache_beam.typehints.decorators module
- apache_beam.typehints.native_type_compatibility module
- apache_beam.typehints.opcodes module
- apache_beam.typehints.pandas_type_compatibility module
- apache_beam.typehints.pytorch_type_compatibility module
- apache_beam.typehints.row_type module
- apache_beam.typehints.schema_registry module
- apache_beam.typehints.schemas module
- apache_beam.typehints.sharded_key_type module
- apache_beam.typehints.trivial_inference module
- apache_beam.typehints.typecheck module
- apache_beam.typehints.typehints module
 
 
- Submodules
- apache_beam.utils package- Submodules- apache_beam.utils.annotations module
- apache_beam.utils.histogram module
- apache_beam.utils.interactive_utils module
- apache_beam.utils.multi_process_shared module
- apache_beam.utils.plugin module
- apache_beam.utils.processes module
- apache_beam.utils.profiler module
- apache_beam.utils.proto_utils module
- apache_beam.utils.python_callable module
- apache_beam.utils.retry module
- apache_beam.utils.sentinel module
- apache_beam.utils.sharded_key module
- apache_beam.utils.shared module
- apache_beam.utils.subprocess_server module
- apache_beam.utils.thread_pool_executor module
- apache_beam.utils.timestamp module
- apache_beam.utils.urns module
 
 
- Submodules