apache_beam.ml.inference.vllm_inference module

class apache_beam.ml.inference.vllm_inference.OpenAIChatMessage(role: str, content: str)[source]

Bases: object

” Dataclass containing previous chat messages in conversation. Role is the entity that sent the message (either ‘user’ or ‘system’). Content is the contents of the message.

role: str

content: str

class apache_beam.ml.inference.vllm_inference.VLLMCompletionsModelHandler(model_name: str, vllm_server_kwargs: dict[str, str] | None = None, *, min_batch_size: int | None = None, max_batch_size: int | None = None, max_batch_duration_secs: int | None = None, max_batch_weight: int | None = None, element_size_fn: Callable[[Any], int] | None = None)[source]

Bases: ModelHandler[str, PredictionResult, _VLLMModelServer]

Implementation of the ModelHandler interface for vLLM using text as input.

Example Usage:

pcoll | RunInference(VLLMModelHandler(model_name='facebook/opt-125m'))

Parameters:

model_name – The vLLM model. See https://docs.vllm.ai/en/latest/models/supported_models.html for supported models.
vllm_server_kwargs – Any additional kwargs to be passed into your vllm server when it is being created. Will be invoked using python -m vllm.entrypoints.openai.api_serverv <beam provided args> <vllm_server_kwargs>. For example, you could pass {‘echo’: ‘true’} to prepend new messages with the previous message. For a list of possible kwargs, see https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-completions-api
min_batch_size – optional. the minimum batch size to use when batching inputs.
max_batch_size – optional. the maximum batch size to use when batching inputs.
max_batch_duration_secs – optional. the maximum amount of time to buffer a batch before emitting; used in streaming contexts.
max_batch_weight – optional. the maximum total weight of a batch.
element_size_fn – optional. a function that returns the size (weight) of an element.

load_model() → _VLLMModelServer[source]

run_inference(batch: Sequence[str], model: _VLLMModelServer, inference_args: dict[str, Any] | None = None) → Iterable[PredictionResult][source]

Runs inferences on a batch of text strings.

Parameters:

batch – A sequence of examples as text strings.
model – A _VLLMModelServer containing info for connecting to the server.
inference_args – Any additional arguments for an inference.

Returns:

An Iterable of type PredictionResult.

share_model_across_processes() → bool[source]

class apache_beam.ml.inference.vllm_inference.VLLMChatModelHandler(model_name: str, chat_template_path: str | None = None, vllm_server_kwargs: dict[str, str] | None = None, *, min_batch_size: int | None = None, max_batch_size: int | None = None, max_batch_duration_secs: int | None = None, max_batch_weight: int | None = None, element_size_fn: Callable[[Any], int] | None = None)[source]

Bases: ModelHandler[Sequence[OpenAIChatMessage], PredictionResult, _VLLMModelServer]

Implementation of the ModelHandler interface for vLLM using previous messages as input.

Example Usage:

pcoll | RunInference(VLLMModelHandler(model_name='facebook/opt-125m'))

Parameters:

model_name – The vLLM model. See https://docs.vllm.ai/en/latest/models/supported_models.html for supported models.
chat_template_path – Path to a chat template. This file must be accessible from your runner’s execution environment, so it is recommended to use a cloud based file storage system (e.g. Google Cloud Storage). For info on chat templates, see: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#chat-template
vllm_server_kwargs – Any additional kwargs to be passed into your vllm server when it is being created. Will be invoked using python -m vllm.entrypoints.openai.api_serverv <beam provided args> <vllm_server_kwargs>. For example, you could pass {‘echo’: ‘true’} to prepend new messages with the previous message. For a list of possible kwargs, see https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-chat-api
min_batch_size – optional. the minimum batch size to use when batching inputs.
max_batch_size – optional. the maximum batch size to use when batching inputs.
max_batch_duration_secs – optional. the maximum amount of time to buffer a batch before emitting; used in streaming contexts.
max_batch_weight – optional. the maximum total weight of a batch.
element_size_fn – optional. a function that returns the size (weight) of an element.

load_model() → _VLLMModelServer[source]

run_inference(batch: Sequence[Sequence[OpenAIChatMessage]], model: _VLLMModelServer, inference_args: dict[str, Any] | None = None) → Iterable[PredictionResult][source]

Runs inferences on a batch of text strings.

Parameters:

batch – A sequence of examples as OpenAI messages.
model – A _VLLMModelServer for connecting to the spun up server.
inference_args – Any additional arguments for an inference.

Returns:

An Iterable of type PredictionResult.

share_model_across_processes() → bool[source]