apache_beam.ml.inference.huggingface_inference module¶

class apache_beam.ml.inference.huggingface_inference.HuggingFaceModelHandlerKeyedTensor(model_uri: str, model_class: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd732384f10>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd7323849d0>], framework: str, device: str = 'CPU', *, inference_fn: Optional[Callable[[...], Iterable[apache_beam.ml.inference.base.PredictionResult]]] = None, load_model_args: Optional[Dict[str, Any]] = None, inference_args: Optional[Dict[str, Any]] = None, min_batch_size: Optional[int] = None, max_batch_size: Optional[int] = None, large_model: bool = False, **kwargs)[source]¶

Bases: apache_beam.ml.inference.base.ModelHandler

Implementation of the ModelHandler interface for HuggingFace with Keyed Tensors for PyTorch/Tensorflow backend.

Example Usage model::

pcoll | RunInference(HuggingFaceModelHandlerKeyedTensor(: model_uri=”bert-base-uncased”, model_class=AutoModelForMaskedLM, framework=’pt’))

Parameters:

model_uri (str) – path to the pretrained model on the hugging face models hub.
model_class – model class to load the repository from model_uri.
framework (str) – Framework to use for the model. ‘tf’ for TensorFlow and ‘pt’ for PyTorch.
device – For torch tensors, specify device on which you wish to run the model. Defaults to CPU.
inference_fn – the inference function to use during RunInference. Default is _run_inference_torch_keyed_tensor or _run_inference_tensorflow_keyed_tensor depending on the input type.
load_model_args (Dict[str, Any]) – (Optional) Keyword arguments to provide load options while loading models from Hugging Face Hub. Defaults to None.
inference_args (Dict[str, Any]) – (Optional) Non-batchable arguments required as inputs to the model’s inference function. Unlike Tensors in batch, these parameters will not be dynamically batched. Defaults to None.
min_batch_size – the minimum batch size to use when batching inputs.
max_batch_size – the maximum batch size to use when batching inputs.
large_model – set to true if your model is large enough to run into memory pressure if you load multiple copies. Given a model that consumes N memory and a machine with W cores and M memory, you should set this to True if N*W > M.
kwargs – ‘env_vars’ can be used to set environment variables before loading the model.

Supported Versions: HuggingFaceModelHandler supports transformers>=4.18.0.

load_model()[source]¶: Loads and initializes the model for processing.

run_inference(batch: Sequence[Dict[str, Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd72faee8e0>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd72faee730>]]], model: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd732384f10>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd7323849d0>], inference_args: Optional[Dict[str, Any]] = None) → Iterable[apache_beam.ml.inference.base.PredictionResult][source]¶

Runs inferences on a batch of Keyed Tensors and returns an Iterable of Tensors Predictions.

This method stacks the list of Tensors in a vectorized format to optimize the inference call.

Parameters:

batch – A sequence of Keyed Tensors. These Tensors should be batchable, as this method will call tf.stack()/torch.stack() and pass in batched Tensors with dimensions (batch_size, n_features, etc.) into the model’s predict() function.
model – A Tensorflow/PyTorch model.
inference_args – Non-batchable arguments required as inputs to the model’s inference function. Unlike Tensors in batch, these parameters will not be dynamically batched.

Returns:

An Iterable of type PredictionResult.

update_model_path(model_path: Optional[str] = None)[source]¶

get_num_bytes(batch: Sequence[Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd72faeeca0>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd732d6eee0>]]) → int[source]¶

Returns:	The number of bytes of data for the Tensors batch.

batch_elements_kwargs()[source]¶

share_model_across_processes() → bool[source]¶

get_metrics_namespace() → str[source]¶

Returns:	A namespace for metrics collected by the RunInference transform.

class apache_beam.ml.inference.huggingface_inference.HuggingFaceModelHandlerTensor(model_uri: str, model_class: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd732384f10>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd7323849d0>], device: str = 'CPU', *, inference_fn: Optional[Callable[[...], Iterable[apache_beam.ml.inference.base.PredictionResult]]] = None, load_model_args: Optional[Dict[str, Any]] = None, inference_args: Optional[Dict[str, Any]] = None, min_batch_size: Optional[int] = None, max_batch_size: Optional[int] = None, large_model: bool = False, **kwargs)[source]¶

Bases: apache_beam.ml.inference.base.ModelHandler

Implementation of the ModelHandler interface for HuggingFace with Tensors for PyTorch/Tensorflow backend.

Depending on the type of tensors, the model framework is determined automatically.

Example Usage model:

pcoll | RunInference(HuggingFaceModelHandlerTensor(: model_uri=”bert-base-uncased”, model_class=AutoModelForMaskedLM))

Parameters:

model_uri (str) – path to the pretrained model on the hugging face models hub.
model_class – model class to load the repository from model_uri.
device – For torch tensors, specify device on which you wish to run the model. Defaults to CPU.
inference_fn – the inference function to use during RunInference. Default is _run_inference_torch_keyed_tensor or _run_inference_tensorflow_keyed_tensor depending on the input type.
load_model_args (Dict[str, Any]) – (Optional) keyword arguments to provide load options while loading models from Hugging Face Hub. Defaults to None.
inference_args (Dict[str, Any]) – (Optional) Non-batchable arguments required as inputs to the model’s inference function. Unlike Tensors in batch, these parameters will not be dynamically batched. Defaults to None.
min_batch_size – the minimum batch size to use when batching inputs.
max_batch_size – the maximum batch size to use when batching inputs.
large_model – set to true if your model is large enough to run into memory pressure if you load multiple copies. Given a model that consumes N memory and a machine with W cores and M memory, you should set this to True if N*W > M.
kwargs – ‘env_vars’ can be used to set environment variables before loading the model.

Supported Versions: HuggingFaceModelHandler supports transformers>=4.18.0.

load_model()[source]¶: Loads and initializes the model for processing.

run_inference(batch: Sequence[Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd732d6edf0>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd732d6ed90>]], model: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd732384f10>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd7323849d0>], inference_args: Optional[Dict[str, Any]] = None) → Iterable[apache_beam.ml.inference.base.PredictionResult][source]¶

Runs inferences on a batch of Tensors and returns an Iterable of Tensors Predictions.

This method stacks the list of Tensors in a vectorized format to optimize the inference call.

Parameters:

batch – A sequence of Tensors. These Tensors should be batchable, as this method will call tf.stack()/torch.stack() and pass in batched Tensors with dimensions (batch_size, n_features, etc.) into the model’s predict() function.
model – A Tensorflow/PyTorch model.
inference_args (Dict[str, Any]) – Non-batchable arguments required as inputs to the model’s inference function. Unlike Tensors in batch, these parameters will not be dynamically batched.

Returns:

An Iterable of type PredictionResult.

update_model_path(model_path: Optional[str] = None)[source]¶

get_num_bytes(batch: Sequence[Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd732d6ed00>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd732d6e550>]]) → int[source]¶

Returns:	The number of bytes of data for the Tensors batch.

batch_elements_kwargs()[source]¶

share_model_across_processes() → bool[source]¶

get_metrics_namespace() → str[source]¶

Returns:	A namespace for metrics collected by the RunInference transform.

class apache_beam.ml.inference.huggingface_inference.HuggingFacePipelineModelHandler(task: Union[str, apache_beam.ml.inference.huggingface_inference.PipelineTask] = '', model: str = '', *, inference_fn: Callable[[Sequence[str], <sphinx.ext.autodoc.importer._MockObject object at 0x7fd732384eb0>, Optional[Dict[str, Any]]], Iterable[apache_beam.ml.inference.base.PredictionResult]] = <function _default_pipeline_inference_fn>, load_pipeline_args: Optional[Dict[str, Any]] = None, inference_args: Optional[Dict[str, Any]] = None, min_batch_size: Optional[int] = None, max_batch_size: Optional[int] = None, large_model: bool = False, **kwargs)[source]¶

Bases: apache_beam.ml.inference.base.ModelHandler

Implementation of the ModelHandler interface for Hugging Face Pipelines.

Note: To specify which device to use (CPU/GPU), use the load_pipeline_args with key-value as you would do in the usual Hugging Face pipeline. Ex: load_pipeline_args={‘device’:0})

Example Usage model::

pcoll | RunInference(HuggingFacePipelineModelHandler(: task=”fill-mask”))

Parameters:

task (str or enum.Enum) – task supported by HuggingFace Pipelines. Accepts a string task or an enum.Enum from PipelineTask.
model (str) –
path to the pretrained model-id on Hugging Face Models Hub to use custom model for the chosen task. If the model already defines the task then no need to specify the task parameter. Use the model-id string instead of an actual model here. Model-specific kwargs for from_pretrained(…, **model_kwargs) can be specified with model_kwargs using load_pipeline_args.

Example Usage::

model_handler = HuggingFacePipelineModelHandler(

task=”text-generation”, model=”meta-llama/Llama-2-7b-hf”, load_pipeline_args={‘model_kwargs’:{‘quantization_map’:config}})
inference_fn – the inference function to use during RunInference. Default is _default_pipeline_inference_fn.
load_pipeline_args (Dict[str, Any]) – keyword arguments to provide load options while loading pipelines from Hugging Face. Defaults to None.
inference_args (Dict[str, Any]) – Non-batchable arguments required as inputs to the model’s inference function. Defaults to None.
min_batch_size – the minimum batch size to use when batching inputs.
max_batch_size – the maximum batch size to use when batching inputs.
large_model – set to true if your model is large enough to run into memory pressure if you load multiple copies. Given a model that consumes N memory and a machine with W cores and M memory, you should set this to True if N*W > M.
kwargs – ‘env_vars’ can be used to set environment variables before loading the model.

Supported Versions: HuggingFacePipelineModelHandler supports transformers>=4.18.0.

load_model()[source]¶: Loads and initializes the pipeline for processing.

run_inference(batch: Sequence[str], pipeline: <sphinx.ext.autodoc.importer._MockObject object at 0x7fd732384eb0>, inference_args: Optional[Dict[str, Any]] = None) → Iterable[apache_beam.ml.inference.base.PredictionResult][source]¶

Runs inferences on a batch of examples passed as a string resource. These can either be string sentences, or string path to images or audio files.

Parameters:	batch – A sequence of strings resources. pipeline – A Hugging Face Pipeline. inference_args – Non-batchable arguments required as inputs to the model’s inference function.
Returns:	An Iterable of type PredictionResult.

update_model_path(model_path: Optional[str] = None)[source]¶

Updates the pretrained model used by the Hugging Face Pipeline task. Make sure that the new model does the same task as initial model.

Parameters:	model_path (str) – (Optional) Path to the new trained model from Hugging Face. Defaults to None.

get_num_bytes(batch: Sequence[str]) → int[source]¶

Returns:	The number of bytes of input batch elements.

batch_elements_kwargs()[source]¶

share_model_across_processes() → bool[source]¶

get_metrics_namespace() → str[source]¶

Returns:	A namespace for metrics collected by the RunInference transform.