Skip to content

Runtime environments

The NOS inference server supports custom runtime environments through the use of the InferenceServiceRuntime class and the configurations defined within. This class provides a high-level interface for defining new custom runtime environments that can be used with NOS.

⚡️ NOS Inference Runtime

We use docker to configure different worker configurations to run workloads in different runtime environments. The configured runtime environments are specified in the InferenceServiceRuntime class, which wraps the generic [DockerRuntime] class. For convenience, we have pre-built some runtime environments that can be used out-of-the-box cpu, gpu, inf2 etc.

This is the general flow of how the runtime environments are configured: - Configure runtime environments including cpu, gpu, inf2 etc in the InferenceServiceRuntime config dictionary. - Start the server with the appropriate runtime environment via the --runtime flag. - The ray cluster is now configured within the appropriate runtime environment and has access to the appropriate libraries and binaries.

For custom runtime support, we use Ray to configure different worker configurations (custom conda environment, with resource naming) to run workers on different runtime environments (see below).

🏃‍♂️ Supported Runtimes

The following runtimes are supported by NOS:

Status Name Pyorch HW Base Size Description
autonomi/nos:latest-cpu 2.1.1 CPU debian:buster-slim 1.1 GB CPU-only runtime.
autonomi/nos:latest-gpu 2.1.1 NVIDIA GPU nvidia/cuda:11.8.0-base-ubuntu22.04 3.9 GB GPU runtime.
autonomi/nos:latest-inf2 1.13.1 AWS Inferentia2 debian:buster-slim 1.7 GB Inf2 runtime with torch-neuronx.
Coming Soon trt 2.0.1 NVIDIA GPU nvidia/cuda:11.8.0-base-ubuntu22.04 GPU runtime with TensorRT (8.4.2.4).

🛠️ Adding a custom runtime

To define a new custom runtime environment, you can extend the InferenceServiceRuntime class and add new configurations to the existing configs variable.

nos.server._runtime.InferenceServiceRuntime.configs class-attribute instance-attribute

configs = {'cpu': InferenceServiceRuntimeConfig(image=NOS_DOCKER_IMAGE_CPU, name=f'{NOS_INFERENCE_SERVICE_CONTAINER_NAME}-cpu', kwargs={'nano_cpus': int(6000000000.0), 'mem_limit': '6g', 'log_config': {'type': JSON, 'config': {'max-size': '100m', 'max-file': '10'}}}), 'gpu': InferenceServiceRuntimeConfig(image=NOS_DOCKER_IMAGE_GPU, name=f'{NOS_INFERENCE_SERVICE_CONTAINER_NAME}-gpu', device='gpu', kwargs={'nano_cpus': int(8000000000.0), 'mem_limit': '12g', 'log_config': {'type': JSON, 'config': {'max-size': '100m', 'max-file': '10'}}}), 'trt': InferenceServiceRuntimeConfig(image='autonomi/nos:latest-trt', name=f'{NOS_INFERENCE_SERVICE_CONTAINER_NAME}-trt', device='gpu', kwargs={'nano_cpus': int(8000000000.0), 'mem_limit': '12g', 'log_config': {'type': JSON, 'config': {'max-size': '100m', 'max-file': '10'}}}), 'inf2': InferenceServiceRuntimeConfig(image='autonomi/nos:latest-inf2', name=f'{NOS_INFERENCE_SERVICE_CONTAINER_NAME}-inf2', device='inf2', environment=_default_environment({'NEURON_RT_VISIBLE_CORES': 2}), kwargs={'nano_cpus': int(8000000000.0), 'log_config': {'type': JSON, 'config': {'max-size': '100m', 'max-file': '10'}}})}