RunInference Metrics
This example demonstrates and explains different metrics that are available when using the RunInference transform to perform inference using a machine learning model. The example uses a pipeline that reads a list of sentences, tokenizes the text, and uses the transformer-based model distilbert-base-uncased-finetuned-sst-2-english
with RunInference
to classify the pieces of text into two classes.
When you run the pipeline with the Dataflow runner, different RunInference metrics are available with CPU and with GPU. This example demonstrates both types of metrics.
- You can find the full code for this example on GitHub.
- You can see RunInference benchmarks on the Performance test metrics page.
The following diagram shows the file structure for the entire pipeline.
runinference_metrics/
├── pipeline/
│ ├── __init__.py
│ ├── options.py
│ └── transformations.py
├── __init__.py
├── config.py
├── main.py
└── setup.py
pipeline/transformations.py
contains the code for beam.DoFn
and additional functions that are used for the pipeline.
pipeline/options.py
contains the pipeline options to configure the Dataflow pipeline.
config.py
defines variables that are used multiple times, like the Google Cloud PROJECT_ID
and NUM_WORKERS
.
setup.py
defines the packages and requirements for the pipeline to run.
main.py
contains the pipeline code and additional functions used for running the pipeline.
Run the Pipeline
Install the required packages. For this example, you need access to a Google Cloud project, and you need to configure the Google Cloud variables, like PROJECT_ID
, REGION
, and others, in the config.py
file. To use GPUs, follow the setup instructions in the PyTorch GPU minimal pipeline example on GitHub.
- Dataflow with CPU:
python main.py --mode cloud --device CPU
- Dataflow with GPU:
python main.py --mode cloud --device GPU
The pipeline includes the following steps:
- Create a list of texts to use as an input using
beam.Create
. - Tokenize the text.
- Use RunInference to do inference.
- Postprocess the output of RunInference.
with beam.Pipeline(options=pipeline_options) as pipeline:
_ = (
pipeline
| "Create inputs" >> beam.Create(inputs)
| "Tokenize" >> beam.ParDo(Tokenize(cfg.TOKENIZER_NAME))
| "Inference" >>
RunInference(model_handler=KeyedModelHandler(model_handler))
| "Decode Predictions" >> beam.ParDo(PostProcessor()))
RunInference Metrics
As mentioned previously, we benchmarked the performance of RunInference using Dataflow on both CPU and GPU. You can see these metrics in the Google Cloud console, or you can use the following line to print the metrics:
The following image shows a snapshot of different metrics in the Google Cloud console when using Dataflow on GPU:
Some metrics commonly used for benchmarking are:
num_inferences
: Represents the total number of elements passed torun_inference()
.inference_batch_latency_micro_secs_MEAN
: Represents the average time taken to perform inference across all batches of examples, measured in microseconds.inference_request_batch_size_COUNT
: Represents the total number of samples across all batches of examples (created frombeam.BatchElements
) to be passed torun_inference()
.inference_request_batch_byte_size_MEAN
: Represents the average size of all elements for all samples in all batches of examples (created frombeam.BatchElements
) to be passed torun_inference()
. This metric is measured in bytes.model_byte_size_MEAN
: Represents the average memory consumed to load and initialize the model, measured in bytes.load_model_latency_milli_secs_MEAN
: Represents the average time taken to load and initialize the model, measured in milliseconds.
You can also derive other relevant metrics, such as in the following example.
Total time taken for inference
=num_inferences x inference_batch_latency_micro_secs_MEAN
Last updated on 2025/01/19
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!