apache_beam.ml.inference.tensorrt_inference module

class apache_beam.ml.inference.tensorrt_inference.TensorRTEngine(engine: <sphinx.ext.autodoc.importer._MockObject object at 0x7f761e738a60>)[source]

Bases: object

Implementation of the TensorRTEngine class which handles allocations associated with TensorRT engine.

Example Usage:

TensorRTEngine(engine)
Parameters:engine – trt.ICudaEngine object that contains TensorRT engine
get_engine_attrs()[source]

Returns TensorRT engine attributes.

class apache_beam.ml.inference.tensorrt_inference.TensorRTEngineHandlerNumPy(min_batch_size: int, max_batch_size: int, *, inference_fn: Callable[[Sequence[numpy.ndarray], apache_beam.ml.inference.tensorrt_inference.TensorRTEngine, Optional[Dict[str, Any]]], Iterable[apache_beam.ml.inference.base.PredictionResult]] = <function _default_tensorRT_inference_fn>, large_model: bool = False, model_copies: Optional[int] = None, max_batch_duration_secs: Optional[int] = None, **kwargs)[source]

Bases: apache_beam.ml.inference.base.ModelHandler

Implementation of the ModelHandler interface for TensorRT.

Example Usage:

pcoll | RunInference(
    TensorRTEngineHandlerNumPy(
      min_batch_size=1,
      max_batch_size=1,
      engine_path="my_uri"))

NOTE: This API and its implementation are under development and do not provide backward compatibility guarantees.

Parameters:
  • min_batch_size – minimum accepted batch size.
  • max_batch_size – maximum accepted batch size.
  • inference_fn – the inference function to use on RunInference calls. default: _default_tensorRT_inference_fn
  • large_model – set to true if your model is large enough to run into memory pressure if you load multiple copies. Given a model that consumes N memory and a machine with W cores and M memory, you should set this to True if N*W > M.
  • model_copies – The exact number of models that you would like loaded onto your machine. This can be useful if you exactly know your CPU or GPU capacity and want to maximize resource utilization.
  • max_batch_duration_secs – the maximum amount of time to buffer a batch before emitting; used in streaming contexts.
  • kwargs – Additional arguments like ‘engine_path’ and ‘onnx_path’ are currently supported. ‘env_vars’ can be used to set environment variables before loading the model.

See https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/ for details

batch_elements_kwargs()[source]

Sets min_batch_size and max_batch_size of a TensorRT engine.

load_model() → apache_beam.ml.inference.tensorrt_inference.TensorRTEngine[source]

Loads and initializes a TensorRT engine for processing.

load_onnx() → Tuple[<sphinx.ext.autodoc.importer._MockObject object at 0x7f761e7ac880>, <sphinx.ext.autodoc.importer._MockObject object at 0x7f761e7ac7c0>][source]

Loads and parses an onnx model for processing.

build_engine(network: <sphinx.ext.autodoc.importer._MockObject object at 0x7f761e7acf70>, builder: <sphinx.ext.autodoc.importer._MockObject object at 0x7f761e7acfa0>) → apache_beam.ml.inference.tensorrt_inference.TensorRTEngine[source]

Build an engine according to parsed/created network.

run_inference(batch: Sequence[numpy.ndarray], engine: apache_beam.ml.inference.tensorrt_inference.TensorRTEngine, inference_args: Optional[Dict[str, Any]] = None) → Iterable[apache_beam.ml.inference.base.PredictionResult][source]

Runs inferences on a batch of Tensors and returns an Iterable of TensorRT Predictions.

Parameters:
  • batch – A np.ndarray or a np.ndarray that represents a concatenation of multiple arrays as a batch.
  • engine – A TensorRT engine.
  • inference_args – Any additional arguments for an inference that are not applicable to TensorRT.
Returns:

An Iterable of type PredictionResult.

get_num_bytes(batch: Sequence[numpy.ndarray]) → int[source]
Returns:The number of bytes of data for a batch of Tensors.
get_metrics_namespace() → str[source]

Returns a namespace for metrics collected by the RunInference transform.

share_model_across_processes() → bool[source]
model_copies() → int[source]