Preprocess data with MLTransform
This page explains how to use the
MLTransform class to preprocess data for machine learning (ML)
workflows. Apache Beam provides a set of data processing transforms for
preprocessing data for training and inference. The
MLTransform class wraps the
various transforms in one class, simplifying your workflow. For a full list of
available transforms, see the Transforms section on this page.
Why use MLTransform
MLTransform, you can use the same preprocessing steps for both training and inference, which ensures consistent results.
- Generate embeddings on text data using large language models (LLMs).
MLTransformcan do a full pass on the dataset, which is useful when you need to transform a single element only after analyzing the entire dataset. For example, with
MLTransform, you can complete the following tasks:
- Normalize an input value by using the minimum and maximum value of the entire dataset.
intsby assigning them buckets, based on the observed data distribution.
intsby generating vocabulary over the entire dataset.
- Count the occurrences of words in all the documents to calculate TF-IDF weights.
Support and limitations
- Available in the Apache Beam Python SDK versions 2.53.0 and later.
- Supports Python 3.8, 3.9, and 3.10.
- Only available for pipelines that use default windows.
You can use
MLTransform to generate text embeddings and to perform various data processing transforms.
Text embedding transforms
You can use
MLTranform to generate embeddings that you can use to push data into vector databases or to run inference.
|Uses the Hugging Face
sentence-transformers models to generate text embeddings.
|Uses models from the the Vertex AI text-embeddings API to generate text embeddings.
Data processing transforms that use TFT
The following set of transforms available in the
MLTransform class come from
the TensorFlow Transforms (TFT) library. TFT offers specialized processing
modules for machine learning tasks. For information about these transforms, see
Module:tft in the
tft.apply_buckets in the TensorFlow documentation.
tft.bucketize in the TensorFlow documentation.
tft.compute_and_apply_vocabulary in the TensorFlow documentation.
tft.ngrams in the TensorFlow documentation.
tft.scale_by_min_max in the TensorFlow documentation.
tft.scale_to_0_1 in the TensorFlow documentation.
tft.scale_to_z_score in the TensorFlow documentation.
tft.tfidf in the TensorFlow documentation.
- Input to the
MLTransformclass must be a dictionary.
MLTransformoutputs a Beam
Rowobject with transformed elements.
- The output
PCollectionis a schema
PCollection. The output schema contains the transformed columns.
Artifacts are additional data elements created by data transformations.
Examples of artifacts are the minimum and maximum values from a
transformation, or the mean and variance from a
MLTransform class, the
write_artifact_location and the
read_artifact_location parameters determine
MLTransform class creates artifacts or retrieves
When you use the
write_artifact_location parameter, the
MLTransform class runs the
specified transformations on the dataset and then creates artifacts from these
transformations. The artifacts are stored in the location that you specify in
Write mode is useful when you want to store the results of your transformations for future use. For example, if you apply the same transformations on a different dataset, use write mode to ensure that the transformation parameters remain consistent.
The following examples demonstrate how write mode works.
ComputeAndApplyVocabularytransform generates a vocabulary file that contains the vocabulary generated over the entire dataset. The vocabulary file is stored in the location specified by the
write_artifact_locationparameter value. The
ComputeAndApplyVocabularytransform outputs the indices of the vocabulary to the vocabulary file.
ScaleToZScoretransform calculates the mean and variance over the entire dataset and then normalizes the entire dataset using the mean and variance. When you use the
write_artifact_locationparameter, these values are stored as a
tensorflowgraph in the location specified by the
write_artifact_locationparameter value. You can reuse the values in read mode to ensure that future transformations use the same mean and variance for normalization.
When you use the
read_artifact_location parameter, the
MLTransform class expects the
artifacts to exist in the value provided in the
In this mode,
MLTransform retrieves the artifacts and uses them in the
transform. Because the transformations are stored in the artifacts when you use
read mode, you don’t need to specify the transformations.
The following scenario provides an example use case for artifacts.
Before training a machine learning model, you use
MLTransform with the
When you run
MLTransform, it applies transformations that preprocess the
dataset. The transformation produces artifacts that are stored in the location
specified by the
write_artifact_location parameter value.
After preprocessing, you use the transformed data to train the machine learning model.
After training, you run inference. You use new test data and use the
read_artifact_location parameter. By using this setting, you ensure that the test
data undergoes the same preprocessing steps as the training data. In read
MLTransform fetches the transformation artifacts from the
location specified in the
read_artifact_location parameter value.
MLTransform applies these artifacts to the test data.
This workflow provides consistency in preprocessing steps for both training and test data. This consistency ensures that the model can accurately evaluate the test data and maintain the integrity of the model’s performance.
Preprocess data with MLTransform
To use the
MLTransform transform to preprocess data, add the following code to
import apache_beam as beam
from apache_beam.ml.transforms.base import MLTransform
from apache_beam.ml.transforms.tft import <TRANSFORM_NAME>
data = [
artifact_location = tempfile.mkdtemp()
<TRANSFORM_FUNCTION_NAME> = <TRANSFORM_NAME>(columns=['x'])
with beam.Pipeline() as p:
transformed_data = (
Replace the following values:
- TRANSFORM_NAME: The name of the transform to use.
- DATA: The input data to transform.
- TRANSFORM_FUNCTION_NAME: The name that you assign to your transform function in your code.
Last updated on 2024/03/01
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!