blog & release
2024/06/26
Apache Beam 2.57.0Kenneth Knowles [@KennKnowles]
We are happy to present the new 2.57.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.
For more information on changes in 2.57.0, check out the detailed release notes.
Highlights
I/Os
- Ensure that BigtableIO closes the reader streams (#31477).
New Features / Improvements
- Added Feast feature store handler for enrichment transform (Python) (#30957).
- BigQuery per-worker metrics are reported by default for Streaming Dataflow Jobs (Java) (#31015)
- Adds
inMemory()
variant of Java List and Map side inputs for more efficient lookups when the entire side input fits into memory. - Beam YAML now supports the jinja templating syntax.
Template variables can be passed with the (json-formatted)
--jinja_variables
flag. - DataFrame API now supports pandas 2.1.x and adds 12 more string functions for Series.(#31185).
- Added BigQuery handler for enrichment transform (Python) (#31295)
- Disable soft delete policy when creating the default bucket for a project (Java) (#31324).
- Added
DoFn.SetupContextParam
andDoFn.BundleContextParam
which can be used as a pythonDoFn.process
,Map
, orFlatMap
parameter to invoke a context manager per DoFn setup or bundle (analogous to usingsetup
/teardown
orstart_bundle
/finish_bundle
respectively.) - Go SDK Prism Runner
- Pre-built Prism binaries are now part of the release and are available via the Github release page. (#29697).
- ProcessingTime is now handled synthetically with TestStream pipelines and Non-TestStream pipelines, for fast test pipeline execution by default. (#30083).
- Prism does NOT yet support “real time” execution for this release.
- Improve processing for large elements to reduce the chances for exceeding 2GB protobuf limits (Python)([https://github.com/apache/beam/issues/31607]).
Breaking Changes
- Java’s View.asList() side inputs are now optimized for iterating rather than indexing when in the global window. This new implementation still supports all (immutable) List methods as before, but some of the random access methods like get() and size() will be slower. To use the old implementation one can use View.asList().withRandomAccess().
- SchemaTransforms implemented with TypedSchemaTransformProvider now produce a
configuration Schema with snake_case naming convention
(#31374). This will make the following
cases problematic:
- Running a pre-2.57.0 remote SDK pipeline containing a 2.57.0+ Java SchemaTransform, and vice versa:
- Running a 2.57.0+ remote SDK pipeline containing a pre-2.57.0 Java SchemaTransform
- All direct uses of Python’s SchemaAwareExternalTransform should be updated to use new snake_case parameter names.
- Upgraded Jackson Databind to 2.15.4 (Java) (#26743). jackson-2.15 has known breaking changes. An important one is it imposed a buffer limit for parser. If your custom PTransform/DoFn are affected, refer to #31580 for mitigation.
Known Issues
- Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue (#32169). The issue will be fixed in 2.59.0 (#32135). To work around this, update the google-cloud-storage package to version 2.18.2 or newer.
- BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform (#32780):
- Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output.
- Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results.
- Fixed in 2.61.0.
For the most up to date list of known issues, see https://github.com/apache/beam/blob/master/CHANGES.md
List of Contributors
According to git shortlog, the following people contributed to the 2.57.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Alexey Romanenko
Andrey Devyatkin
Anody Zhang
Arvind Ram
Ben Konz
Bruno Volpato
Celeste Zeng
Chamikara Jayalath
Claire McGinty
Colm O hEigeartaigh
Damon
Danny McCormick
Evan Galpin
Ferran Fernández Garrido
Florent Biville
Jack Dingilian
Jack McCluskey
Jan Lukavský
JayajP
Jeff Kinard
Jeffrey Kinard
John Casey
Justin Uang
Kenneth Knowles
Kevin Zhou
Liam Miller-Cushon
Maarten Vercruysse
Maciej Szwaja
Maja Kontrec Rönn
Marc hurabielle
Martin Trieu
Mattie Fu
Min Zhu
Naireen Hussain
Nick Anikin
Pablo Rodriguez Defino
Paul King
Priyans Desai
Radosław Stankiewicz
Rebecca Szper
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Rodrigo Bozzolo
RyuSA
Sam Rohde
Sam Whittle
Sergei Lilichenko
Shahar Epstein
Shunping Huang
Svetak Sundhar
Tomo Suzuki
Tony Tang
Valentyn Tymofieiev
Vincent Stollenwerk
Vineet Kumar
Vitaly Terentyev
Vlado Djerek
XQ Hu
Yi Hu
akashorabek
bzablocki
kberezin