apache_beam.ml.inference.tensorrt_inference module¶

class apache_beam.ml.inference.tensorrt_inference.TensorRTEngine(engine: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2fc17eb220>)[source]¶

Bases: object

Implementation of the TensorRTEngine class which handles allocations associated with TensorRT engine.

Example Usage:

TensorRTEngine(engine)

Parameters:	engine – trt.ICudaEngine object that contains TensorRT engine

get_engine_attrs()[source]¶: Returns TensorRT engine attributes.

class apache_beam.ml.inference.tensorrt_inference.TensorRTEngineHandlerNumPy(min_batch_size: int, max_batch_size: int, *, inference_fn: Callable[[Sequence[numpy.ndarray], apache_beam.ml.inference.tensorrt_inference.TensorRTEngine, Optional[Dict[str, Any]]], Iterable[apache_beam.ml.inference.base.PredictionResult]] = <function _default_tensorRT_inference_fn>, large_model: bool = False, max_batch_duration_secs: Optional[int] = None, **kwargs)[source]¶

Bases: apache_beam.ml.inference.base.ModelHandler

Implementation of the ModelHandler interface for TensorRT.

Example Usage:

pcoll | RunInference(
    TensorRTEngineHandlerNumPy(
      min_batch_size=1,
      max_batch_size=1,
      engine_path="my_uri"))

NOTE: This API and its implementation are under development and do not provide backward compatibility guarantees.

Parameters:

min_batch_size – minimum accepted batch size.
max_batch_size – maximum accepted batch size.
inference_fn – the inference function to use on RunInference calls. default: _default_tensorRT_inference_fn
large_model – set to true if your model is large enough to run into memory pressure if you load multiple copies. Given a model that consumes N memory and a machine with W cores and M memory, you should set this to True if N*W > M.
max_batch_duration_secs – the maximum amount of time to buffer a batch before emitting; used in streaming contexts.
kwargs – Additional arguments like ‘engine_path’ and ‘onnx_path’ are currently supported. ‘env_vars’ can be used to set environment variables before loading the model.

See https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/ for details

batch_elements_kwargs()[source]¶: Sets min_batch_size and max_batch_size of a TensorRT engine.

load_model() → apache_beam.ml.inference.tensorrt_inference.TensorRTEngine[source]¶: Loads and initializes a TensorRT engine for processing.

load_onnx() → Tuple[<sphinx.ext.autodoc.importer._MockObject object at 0x7f2fc17f7160>, <sphinx.ext.autodoc.importer._MockObject object at 0x7f2fc17f7070>][source]¶: Loads and parses an onnx model for processing.

build_engine(network: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2fc17f7580>, builder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2fc17f7550>) → apache_beam.ml.inference.tensorrt_inference.TensorRTEngine[source]¶: Build an engine according to parsed/created network.

run_inference(batch: Sequence[numpy.ndarray], engine: apache_beam.ml.inference.tensorrt_inference.TensorRTEngine, inference_args: Optional[Dict[str, Any]] = None) → Iterable[apache_beam.ml.inference.base.PredictionResult][source]¶

Runs inferences on a batch of Tensors and returns an Iterable of TensorRT Predictions.

Parameters:	batch – A np.ndarray or a np.ndarray that represents a concatenation of multiple arrays as a batch. engine – A TensorRT engine. inference_args – Any additional arguments for an inference that are not applicable to TensorRT.
Returns:	An Iterable of type PredictionResult.

get_num_bytes(batch: Sequence[numpy.ndarray]) → int[source]¶

Returns:	The number of bytes of data for a batch of Tensors.

get_metrics_namespace() → str[source]¶: Returns a namespace for metrics collected by the RunInference transform.

share_model_across_processes() → bool[source]¶