apache_beam.ml.inference.vllm_inference module

class apache_beam.ml.inference.vllm_inference.OpenAIChatMessage(role: str, content: str)[source]

Bases: object

” Dataclass containing previous chat messages in conversation. Role is the entity that sent the message (either ‘user’ or ‘system’). Content is the contents of the message.

role: str
content: str
class apache_beam.ml.inference.vllm_inference.VLLMCompletionsModelHandler(model_name: str, vllm_server_kwargs: Dict[str, str] | None = None)[source]

Bases: ModelHandler[str, PredictionResult, _VLLMModelServer]

Implementation of the ModelHandler interface for vLLM using text as input.

Example Usage:

pcoll | RunInference(VLLMModelHandler(model_name='facebook/opt-125m'))
Parameters:
load_model() _VLLMModelServer[source]
run_inference(batch: Sequence[str], model: _VLLMModelServer, inference_args: Dict[str, Any] | None = None) Iterable[PredictionResult][source]

Runs inferences on a batch of text strings.

Parameters:
  • batch – A sequence of examples as text strings.

  • model – A _VLLMModelServer containing info for connecting to the server.

  • inference_args – Any additional arguments for an inference.

Returns:

An Iterable of type PredictionResult.

share_model_across_processes() bool[source]
class apache_beam.ml.inference.vllm_inference.VLLMChatModelHandler(model_name: str, chat_template_path: str | None = None, vllm_server_kwargs: Dict[str, str] | None = None)[source]

Bases: ModelHandler[Sequence[OpenAIChatMessage], PredictionResult, _VLLMModelServer]

Implementation of the ModelHandler interface for vLLM using previous messages as input.

Example Usage:

pcoll | RunInference(VLLMModelHandler(model_name='facebook/opt-125m'))
Parameters:
load_model() _VLLMModelServer[source]
run_inference(batch: Sequence[Sequence[OpenAIChatMessage]], model: _VLLMModelServer, inference_args: Dict[str, Any] | None = None) Iterable[PredictionResult][source]

Runs inferences on a batch of text strings.

Parameters:
  • batch – A sequence of examples as OpenAI messages.

  • model – A _VLLMModelServer for connecting to the spun up server.

  • inference_args – Any additional arguments for an inference.

Returns:

An Iterable of type PredictionResult.

share_model_across_processes() bool[source]