apache_beam.ml.inference.vllm_inference module
- class apache_beam.ml.inference.vllm_inference.OpenAIChatMessage(role: str, content: str)[source]
Bases:
object
” Dataclass containing previous chat messages in conversation. Role is the entity that sent the message (either ‘user’ or ‘system’). Content is the contents of the message.
- class apache_beam.ml.inference.vllm_inference.VLLMCompletionsModelHandler(model_name: str, vllm_server_kwargs: Dict[str, str] | None = None)[source]
Bases:
ModelHandler
[str
,PredictionResult
,_VLLMModelServer
]Implementation of the ModelHandler interface for vLLM using text as input.
Example Usage:
pcoll | RunInference(VLLMModelHandler(model_name='facebook/opt-125m'))
- Parameters:
model_name – The vLLM model. See https://docs.vllm.ai/en/latest/models/supported_models.html for supported models.
vllm_server_kwargs – Any additional kwargs to be passed into your vllm server when it is being created. Will be invoked using python -m vllm.entrypoints.openai.api_serverv <beam provided args> <vllm_server_kwargs>. For example, you could pass {‘echo’: ‘true’} to prepend new messages with the previous message. For a list of possible kwargs, see https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-completions-api
- run_inference(batch: Sequence[str], model: _VLLMModelServer, inference_args: Dict[str, Any] | None = None) Iterable[PredictionResult] [source]
Runs inferences on a batch of text strings.
- Parameters:
batch – A sequence of examples as text strings.
model – A _VLLMModelServer containing info for connecting to the server.
inference_args – Any additional arguments for an inference.
- Returns:
An Iterable of type PredictionResult.
- class apache_beam.ml.inference.vllm_inference.VLLMChatModelHandler(model_name: str, chat_template_path: str | None = None, vllm_server_kwargs: Dict[str, str] | None = None)[source]
Bases:
ModelHandler
[Sequence
[OpenAIChatMessage
],PredictionResult
,_VLLMModelServer
]Implementation of the ModelHandler interface for vLLM using previous messages as input.
Example Usage:
pcoll | RunInference(VLLMModelHandler(model_name='facebook/opt-125m'))
- Parameters:
model_name – The vLLM model. See https://docs.vllm.ai/en/latest/models/supported_models.html for supported models.
chat_template_path – Path to a chat template. This file must be accessible from your runner’s execution environment, so it is recommended to use a cloud based file storage system (e.g. Google Cloud Storage). For info on chat templates, see: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#chat-template
vllm_server_kwargs – Any additional kwargs to be passed into your vllm server when it is being created. Will be invoked using python -m vllm.entrypoints.openai.api_serverv <beam provided args> <vllm_server_kwargs>. For example, you could pass {‘echo’: ‘true’} to prepend new messages with the previous message. For a list of possible kwargs, see https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-chat-api
- run_inference(batch: Sequence[Sequence[OpenAIChatMessage]], model: _VLLMModelServer, inference_args: Dict[str, Any] | None = None) Iterable[PredictionResult] [source]
Runs inferences on a batch of text strings.
- Parameters:
batch – A sequence of examples as OpenAI messages.
model – A _VLLMModelServer for connecting to the spun up server.
inference_args – Any additional arguments for an inference.
- Returns:
An Iterable of type PredictionResult.