apache_beam.ml.inference.huggingface_inference module¶
-
class
apache_beam.ml.inference.huggingface_inference.
HuggingFaceModelHandlerKeyedTensor
(model_uri: str, model_class: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd732384f10>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd7323849d0>], framework: str, device: str = 'CPU', *, inference_fn: Optional[Callable[[...], Iterable[apache_beam.ml.inference.base.PredictionResult]]] = None, load_model_args: Optional[Dict[str, Any]] = None, inference_args: Optional[Dict[str, Any]] = None, min_batch_size: Optional[int] = None, max_batch_size: Optional[int] = None, large_model: bool = False, **kwargs)[source]¶ Bases:
apache_beam.ml.inference.base.ModelHandler
Implementation of the ModelHandler interface for HuggingFace with Keyed Tensors for PyTorch/Tensorflow backend.
- Example Usage model::
- pcoll | RunInference(HuggingFaceModelHandlerKeyedTensor(
- model_uri=”bert-base-uncased”, model_class=AutoModelForMaskedLM, framework=’pt’))
Parameters: - model_uri (str) – path to the pretrained model on the hugging face models hub.
- model_class – model class to load the repository from model_uri.
- framework (str) – Framework to use for the model. ‘tf’ for TensorFlow and ‘pt’ for PyTorch.
- device – For torch tensors, specify device on which you wish to run the model. Defaults to CPU.
- inference_fn – the inference function to use during RunInference. Default is _run_inference_torch_keyed_tensor or _run_inference_tensorflow_keyed_tensor depending on the input type.
- load_model_args (Dict[str, Any]) – (Optional) Keyword arguments to provide load options while loading models from Hugging Face Hub. Defaults to None.
- inference_args (Dict[str, Any]) – (Optional) Non-batchable arguments required as inputs to the model’s inference function. Unlike Tensors in batch, these parameters will not be dynamically batched. Defaults to None.
- min_batch_size – the minimum batch size to use when batching inputs.
- max_batch_size – the maximum batch size to use when batching inputs.
- large_model – set to true if your model is large enough to run into memory pressure if you load multiple copies. Given a model that consumes N memory and a machine with W cores and M memory, you should set this to True if N*W > M.
- kwargs – ‘env_vars’ can be used to set environment variables before loading the model.
Supported Versions: HuggingFaceModelHandler supports transformers>=4.18.0.
-
run_inference
(batch: Sequence[Dict[str, Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd72faee8e0>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd72faee730>]]], model: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd732384f10>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd7323849d0>], inference_args: Optional[Dict[str, Any]] = None) → Iterable[apache_beam.ml.inference.base.PredictionResult][source]¶ Runs inferences on a batch of Keyed Tensors and returns an Iterable of Tensors Predictions.
This method stacks the list of Tensors in a vectorized format to optimize the inference call.
Parameters: - batch – A sequence of Keyed Tensors. These Tensors should be batchable, as this method will call tf.stack()/torch.stack() and pass in batched Tensors with dimensions (batch_size, n_features, etc.) into the model’s predict() function.
- model – A Tensorflow/PyTorch model.
- inference_args – Non-batchable arguments required as inputs to the model’s inference function. Unlike Tensors in batch, these parameters will not be dynamically batched.
Returns: An Iterable of type PredictionResult.
-
get_num_bytes
(batch: Sequence[Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd72faeeca0>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd732d6eee0>]]) → int[source]¶ Returns: The number of bytes of data for the Tensors batch.
-
class
apache_beam.ml.inference.huggingface_inference.
HuggingFaceModelHandlerTensor
(model_uri: str, model_class: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd732384f10>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd7323849d0>], device: str = 'CPU', *, inference_fn: Optional[Callable[[...], Iterable[apache_beam.ml.inference.base.PredictionResult]]] = None, load_model_args: Optional[Dict[str, Any]] = None, inference_args: Optional[Dict[str, Any]] = None, min_batch_size: Optional[int] = None, max_batch_size: Optional[int] = None, large_model: bool = False, **kwargs)[source]¶ Bases:
apache_beam.ml.inference.base.ModelHandler
Implementation of the ModelHandler interface for HuggingFace with Tensors for PyTorch/Tensorflow backend.
Depending on the type of tensors, the model framework is determined automatically.
- Example Usage model:
- pcoll | RunInference(HuggingFaceModelHandlerTensor(
- model_uri=”bert-base-uncased”, model_class=AutoModelForMaskedLM))
Parameters: - model_uri (str) – path to the pretrained model on the hugging face models hub.
- model_class – model class to load the repository from model_uri.
- device – For torch tensors, specify device on which you wish to run the model. Defaults to CPU.
- inference_fn – the inference function to use during RunInference. Default is _run_inference_torch_keyed_tensor or _run_inference_tensorflow_keyed_tensor depending on the input type.
- load_model_args (Dict[str, Any]) – (Optional) keyword arguments to provide load options while loading models from Hugging Face Hub. Defaults to None.
- inference_args (Dict[str, Any]) – (Optional) Non-batchable arguments required as inputs to the model’s inference function. Unlike Tensors in batch, these parameters will not be dynamically batched. Defaults to None.
- min_batch_size – the minimum batch size to use when batching inputs.
- max_batch_size – the maximum batch size to use when batching inputs.
- large_model – set to true if your model is large enough to run into memory pressure if you load multiple copies. Given a model that consumes N memory and a machine with W cores and M memory, you should set this to True if N*W > M.
- kwargs – ‘env_vars’ can be used to set environment variables before loading the model.
Supported Versions: HuggingFaceModelHandler supports transformers>=4.18.0.
-
run_inference
(batch: Sequence[Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd732d6edf0>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd732d6ed90>]], model: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd732384f10>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd7323849d0>], inference_args: Optional[Dict[str, Any]] = None) → Iterable[apache_beam.ml.inference.base.PredictionResult][source]¶ Runs inferences on a batch of Tensors and returns an Iterable of Tensors Predictions.
This method stacks the list of Tensors in a vectorized format to optimize the inference call.
Parameters: - batch – A sequence of Tensors. These Tensors should be batchable, as this method will call tf.stack()/torch.stack() and pass in batched Tensors with dimensions (batch_size, n_features, etc.) into the model’s predict() function.
- model – A Tensorflow/PyTorch model.
- inference_args (Dict[str, Any]) – Non-batchable arguments required as inputs to the model’s inference function. Unlike Tensors in batch, these parameters will not be dynamically batched.
Returns: An Iterable of type PredictionResult.
-
get_num_bytes
(batch: Sequence[Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7fd732d6ed00>, <sphinx.ext.autodoc.importer._MockObject object at 0x7fd732d6e550>]]) → int[source]¶ Returns: The number of bytes of data for the Tensors batch.
-
class
apache_beam.ml.inference.huggingface_inference.
HuggingFacePipelineModelHandler
(task: Union[str, apache_beam.ml.inference.huggingface_inference.PipelineTask] = '', model: str = '', *, inference_fn: Callable[[Sequence[str], <sphinx.ext.autodoc.importer._MockObject object at 0x7fd732384eb0>, Optional[Dict[str, Any]]], Iterable[apache_beam.ml.inference.base.PredictionResult]] = <function _default_pipeline_inference_fn>, load_pipeline_args: Optional[Dict[str, Any]] = None, inference_args: Optional[Dict[str, Any]] = None, min_batch_size: Optional[int] = None, max_batch_size: Optional[int] = None, large_model: bool = False, **kwargs)[source]¶ Bases:
apache_beam.ml.inference.base.ModelHandler
Implementation of the ModelHandler interface for Hugging Face Pipelines.
Note: To specify which device to use (CPU/GPU), use the load_pipeline_args with key-value as you would do in the usual Hugging Face pipeline. Ex: load_pipeline_args={‘device’:0})
- Example Usage model::
- pcoll | RunInference(HuggingFacePipelineModelHandler(
- task=”fill-mask”))
Parameters: - task (str or enum.Enum) – task supported by HuggingFace Pipelines. Accepts a string task or an enum.Enum from PipelineTask.
- model (str) –
path to the pretrained model-id on Hugging Face Models Hub to use custom model for the chosen task. If the model already defines the task then no need to specify the task parameter. Use the model-id string instead of an actual model here. Model-specific kwargs for from_pretrained(…, **model_kwargs) can be specified with model_kwargs using load_pipeline_args.
- Example Usage::
- model_handler = HuggingFacePipelineModelHandler(
- task=”text-generation”, model=”meta-llama/Llama-2-7b-hf”, load_pipeline_args={‘model_kwargs’:{‘quantization_map’:config}})
- inference_fn – the inference function to use during RunInference. Default is _default_pipeline_inference_fn.
- load_pipeline_args (Dict[str, Any]) – keyword arguments to provide load options while loading pipelines from Hugging Face. Defaults to None.
- inference_args (Dict[str, Any]) – Non-batchable arguments required as inputs to the model’s inference function. Defaults to None.
- min_batch_size – the minimum batch size to use when batching inputs.
- max_batch_size – the maximum batch size to use when batching inputs.
- large_model – set to true if your model is large enough to run into memory pressure if you load multiple copies. Given a model that consumes N memory and a machine with W cores and M memory, you should set this to True if N*W > M.
- kwargs – ‘env_vars’ can be used to set environment variables before loading the model.
Supported Versions: HuggingFacePipelineModelHandler supports transformers>=4.18.0.
-
run_inference
(batch: Sequence[str], pipeline: <sphinx.ext.autodoc.importer._MockObject object at 0x7fd732384eb0>, inference_args: Optional[Dict[str, Any]] = None) → Iterable[apache_beam.ml.inference.base.PredictionResult][source]¶ Runs inferences on a batch of examples passed as a string resource. These can either be string sentences, or string path to images or audio files.
Parameters: - batch – A sequence of strings resources.
- pipeline – A Hugging Face Pipeline.
- inference_args – Non-batchable arguments required as inputs to the model’s inference function.
Returns: An Iterable of type PredictionResult.
-
update_model_path
(model_path: Optional[str] = None)[source]¶ Updates the pretrained model used by the Hugging Face Pipeline task. Make sure that the new model does the same task as initial model.
Parameters: model_path (str) – (Optional) Path to the new trained model from Hugging Face. Defaults to None.
-
get_num_bytes
(batch: Sequence[str]) → int[source]¶ Returns: The number of bytes of input batch elements.