apache_beam.ml.inference.tensorrt_inference module

class apache_beam.ml.inference.tensorrt_inference.TensorRTEngine(engine: tensorrt.ICudaEngine)[source]

Bases: object

Implementation of the TensorRTEngine class which handles allocations associated with TensorRT engine.

Example Usage:

TensorRTEngine(engine)

Parameters:: engine – trt.ICudaEngine object that contains TensorRT engine

get_engine_attrs()[source]: Returns TensorRT engine attributes.

class apache_beam.ml.inference.tensorrt_inference.TensorRTEngineHandlerNumPy(min_batch_size: int, max_batch_size: int, *, inference_fn: ~collections.abc.Callable[[~collections.abc.Sequence[~numpy.ndarray], ~apache_beam.ml.inference.tensorrt_inference.TensorRTEngine, dict[str, ~typing.Any] | None], ~collections.abc.Iterable[~apache_beam.ml.inference.base.PredictionResult]] = <function _default_tensorRT_inference_fn>, large_model: bool = False, model_copies: int | None = None, max_batch_duration_secs: int | None = None, **kwargs)[source]

Bases: ModelHandler[ndarray, PredictionResult, TensorRTEngine]

Implementation of the ModelHandler interface for TensorRT.

Example Usage:

pcoll | RunInference(
    TensorRTEngineHandlerNumPy(
      min_batch_size=1,
      max_batch_size=1,
      engine_path="my_uri"))

NOTE: This API and its implementation are under development and do not provide backward compatibility guarantees.

Parameters:

min_batch_size – minimum accepted batch size.
max_batch_size – maximum accepted batch size.
inference_fn – the inference function to use on RunInference calls. default: _default_tensorRT_inference_fn
large_model – set to true if your model is large enough to run into memory pressure if you load multiple copies. Given a model that consumes N memory and a machine with W cores and M memory, you should set this to True if N*W > M.
model_copies – The exact number of models that you would like loaded onto your machine. This can be useful if you exactly know your CPU or GPU capacity and want to maximize resource utilization.
max_batch_duration_secs – the maximum amount of time to buffer a batch before emitting; used in streaming contexts.
kwargs – Additional arguments like ‘engine_path’ and ‘onnx_path’ are currently supported. ‘env_vars’ can be used to set environment variables before loading the model.

See https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/ for details

batch_elements_kwargs()[source]: Sets min_batch_size and max_batch_size of a TensorRT engine.

load_model() → TensorRTEngine[source]: Loads and initializes a TensorRT engine for processing.

load_onnx() → tuple[tensorrt.INetworkDefinition, tensorrt.Builder][source]: Loads and parses an onnx model for processing.

build_engine(network: tensorrt.INetworkDefinition, builder: tensorrt.Builder) → TensorRTEngine[source]: Build an engine according to parsed/created network.

run_inference(batch: Sequence[ndarray], engine: TensorRTEngine, inference_args: dict[str, Any] | None = None) → Iterable[PredictionResult][source]

Runs inferences on a batch of Tensors and returns an Iterable of TensorRT Predictions.

Parameters:

batch – A np.ndarray or a np.ndarray that represents a concatenation of multiple arrays as a batch.
engine – A TensorRT engine.
inference_args – Any additional arguments for an inference that are not applicable to TensorRT.

Returns:

An Iterable of type PredictionResult.

get_num_bytes(batch: Sequence[ndarray]) → int[source]

Returns:: The number of bytes of data for a batch of Tensors.

get_metrics_namespace() → str[source]: Returns a namespace for metrics collected by the RunInference transform.

share_model_across_processes() → bool[source]

model_copies() → int[source]