“Lyft Marketplace team aims to improve our business efficiency by being nimble to real-world dynamics. Apache Beam has enabled us to meet the goal of having a robust and scalable ML infrastructure for improving model accuracy with features in real-time. These real-time features support critical functions like Forecasting, Primetime, Dispatch.”
Real-time ML with Beam at Lyft
Lyft, Inc. is an American mobility-as-a-service provider that offers ride-hailing, car and motorized scooter rentals, bicycle-sharing, food delivery, and business transportation solutions. Lyft is based in San Francisco, California, and operates in 644 cities in the United States and 12 cities in Canada.
As you might expect from a company as large as Lyft, connecting drivers and riders in space and time at such a scale requires a powerful real-time streaming infrastructure. Ravi Kiran Magham, Software Engineer at Lyft, shared the story of how Apache Beam has become a mission-critical and integral real-time data processing technology for Lyft by enabling large-scale streaming data processing and machine learning pipelines.
Democratizing Stream Processing
Lyft originally built streaming ETL pipelines to transform, enrich, and sink events generated by application services to their data lake in AWS S3 using Amazon Kinesis and Apache Flink. Apache Flink is the foundation of Lyft’s streaming architecture and was chosen over Apache Spark due to its robust, fault-tolerant, and intuitive API for distributed stateful stream processing, exactly-once processing, and variety of I/O connectors.
Lyft’s popularity and growth were bringing new demands to data streaming infrastructure: more teams with diverse programming language preferences wanted to explore event-driven streaming applications, and build streaming features for real-time machine learning models to make business more efficient, enhance customer experiences, and provide time-sensitive compliance operations. The Data Platform team looked into improving the prime time (surge pricing) computation for the Marketplace team, which had a service orchestrating an ensemble of ML models, exchanging data over Redis. The teams aimed at reducing code complexity and improving latency (from 5 to < 1 min end to end). With Python being a prerequisite by the Marketplace team and Java being heavily used by the Data Platform team, Lyft started exploring the Apache Beam portability framework in 2019 to democratize streaming for all teams.
The Apache Beam portability and multi-language capabilities were the key pique and the primary reason for us to start exploring Beam in a bigger way.
Apache Beam provides a solution to the programming language and data processing engine dilemma, as it offers a variety of runners (including the Beam Flink runner for Apache Flink) and a variety of programming language SDKs. Apache Beam offers an ultimate level of portability with its concept of “write once, run anywhere” and its ability to create multi-language pipelines - data pipelines that use transforms from more than one programming language.
Leveraging Apache Beam has been a “win-win” decision for us because our data infra teams use Java but we are able to offer Python SDK for our product teams, as it has been the de-facto language that they prefer. We write streaming pipelines with ease and comfort and run them on the Beam Flink runner.
The Data Platform team built a control plane of in-house services and FlinkK8sOperator to manage Flink applications on a Kubernetes cluster and deploy streaming Apache Beam and Apache Flink jobs. Lyft uses a blue/green deployment strategy on critical pipelines to minimize any downtime and uses custom macros for improved observability and seamless integration of the CI/CD deployments. To improve developer productivity, the Data Platform team offers a lightweight, YAML-based DSL to abstract the source and sink configurations, and provides reusable Apache Beam PTransforms for filtering and enrichment of incoming events.
Powering Real-time Machine Learning Pipelines
Lyft Marketplace plays a pivotal role in optimizing fleet demand and supply prediction, dynamic pricing, ETA calculation, and more. The Apache Beam Python SDK and Flink Runner enable the team to be nimble to change and support the demands for real-time ML – streaming feature generation and model execution. The Data Platform team has extended the streaming infrastructure to support Continual Learning use cases. Apache Beam powers continuous training of ML models with real-time data over larger windows of 2 hours to identify and fine-tune biases in cost and ETA.
Lyft separated Feature Generation and ML Model Execution into multiple streaming pipelines. The streaming Apache Beam pipeline generates features in real-time and writes them to a Kafka topic to be consumed by the model execution pipeline. Based on user configuration, the features are replicated and keyed out by model ID to stateful ParDo transforms, which leverage timers and/or data (feature) availability to invoke ML models. Features are stored in a global window and the state is explicitly cleaned up. The ML models run as part of the Model Serving infrastructure and their output can be an input feature to another ML model. To support this DAG workflow, Apache Beam pipelines write the output to Kafka and feed it to the model execution streaming pipeline for processing, in addition to writing it to Redis.
The complex real-time Feature Generation involves processing ~4 million events of 1KB per minute with sub-second latency, generating ~100 features on multiple event attributes across space and time granularities (1 and 5 minutes). Apache Beam allowed the Lyft Marketplace team to reduce latency by 60%, significantly simplify the code, and onboard many teams and use cases onto streaming.
The Marketplace team are heavy users of Apache Beam for real-time feature computation and model executions. Processing events in real-time with a sub-second latency allows our ML models to understand marketplace dynamics early and make informed decisions.
Amplifying Use Cases
Lyft has leveraged Apache Beam for more than 60 use cases and enabled them to complete critical business commitments and improve real-time user experiences.
For example, Lyft’s Map Data Delivery team moved from a batch process to a streaming pipeline for identifying road closures in real-time. Their Routing Engine uses this information to determine the best routes, improve ETA and provide a better driver and customer experience. The job processes ~400k events per second, conflates streams of data coming from 3rd party road closures and real-time traffic data to determine actual closures and publish them as events to Kafka. A custom S3 PTransform allows for the job to regularly publish a snapshot of closures for downstream batch processing.
Apache Beam enabled Lyft to optimize a very specific use case that relates to reporting pick-ups and drop-offs at airports. Airports require mobility applications to report every pick-up and drop-off and match them with the time of fleet entry and exit. Failing to do so results in a lower compliance score and even risk of being penalized. Originally, Lyft had a complicated implementation using the KCL library to consume events and store them in Redis. Python worker processes ran at regular intervals to consume data from Redis, join and enrich the data with service API calls, and send the output to airport applications. With that implementation, late-arriving updates and out-of-order events significantly impacted the completeness score. Lyft migrated the use case to a streaming Apache Beam pipeline with state and timers to keep events in a global window and manage sessions. Apache Beam helped Lyft achieve a top compliance score by improving the latency of event reporting from 5 to 2 seconds and reducing missing entry/exit data to 1.3%.
Like many companies shaking up standard business models, Lyft relies on open-source software and likes to give back to the community. Many of the big data frameworks, tools, and implementations developed by Lyft are open-sourced on their GitHub. Lyft has been an ample Apache Beam contributor since 2018, and Lyft engineers have presented their Apache Beam integrations at various events, such as Beam Summit North America, Berlin Buzzwords, O’Reilly Strata Data & AI, and more.
The portability of the Apache Beam model is the key to distributed execution. It enabled Lyft to run mission-critical data pipelines written in a non-JVM language on a JVM-based runner. Thus, they avoided code rewrites and sidestepped the potential cost of many API styles and runtime environments, reducing pipeline development time from multiple days to just hours. Full isolation of user code and native CPython execution without library restrictions resulted in easy onboarding and adoption. Apache Beam’s multi-language and cross-language capabilities solved Lyft’s programming language dilemma. With the unified programming model, Lyft is no longer tied to a specific technology stack.
Apache Beam enabled Lyft to switch from batch ML model training to real-time ML training with granular control of data freshness using windowing. Their data engineering and product teams can use both Python and Java, based on the appropriateness for a particular task or their preference. Apache Beam has helped Lyft successfully build and scale 60+ streaming pipelines processing events at very low latencies in near-real-time. New use cases keep coming, and Lyft is planning on leveraging Beam SQL and the Go SDK to provide a full range of Apache Beam multi-language capabilities for their teams.
Was this information useful?