Runtime environments
The NOS inference server supports custom runtime environments through the use of the InferenceServiceRuntime
class and the configurations defined within. This class provides a high-level interface for defining new custom runtime environments that can be used with NOS.
⚡️ NOS Inference Runtime¶
We use docker to configure different worker configurations to run workloads in different runtime environments. The configured runtime environments are specified in the InferenceServiceRuntime class, which wraps the generic [DockerRuntime
] class. For convenience, we have pre-built some runtime environments that can be used out-of-the-box cpu
, gpu
, inf2
etc.
This is the general flow of how the runtime environments are configured:
- Configure runtime environments including cpu
, gpu
, inf2
etc in the InferenceServiceRuntime
config
dictionary.
- Start the server with the appropriate runtime environment via the --runtime
flag.
- The ray cluster is now configured within the appropriate runtime environment and has access to the appropriate libraries and binaries.
For custom runtime support, we use Ray to configure different worker configurations (custom conda environment, with resource naming) to run workers on different runtime environments (see below).
🏃♂️ Supported Runtimes¶
The following runtimes are supported by NOS:
Status | Name | Pyorch | HW | Base | Size | Description |
---|---|---|---|---|---|---|
✅ | autonomi/nos:latest-cpu |
2.1.1 |
CPU | debian:buster-slim |
1.1 GB | CPU-only runtime. |
✅ | autonomi/nos:latest-gpu |
2.1.1 |
NVIDIA GPU | nvidia/cuda:11.8.0-base-ubuntu22.04 |
3.9 GB | GPU runtime. |
✅ | autonomi/nos:latest-inf2 |
1.13.1 |
AWS Inferentia2 | debian:buster-slim |
1.7 GB | Inf2 runtime with torch-neuronx. |
Coming Soon | trt |
2.0.1 |
NVIDIA GPU | nvidia/cuda:11.8.0-base-ubuntu22.04 |
GPU runtime with TensorRT (8.4.2.4). |
🛠️ Adding a custom runtime¶
To define a new custom runtime environment, you can extend the InferenceServiceRuntime
class and add new configurations to the existing configs
variable.
nos.server._runtime.InferenceServiceRuntime.configs
class-attribute
instance-attribute
¶
configs = {'cpu': InferenceServiceRuntimeConfig(image=NOS_DOCKER_IMAGE_CPU, name=f'{NOS_INFERENCE_SERVICE_CONTAINER_NAME}-cpu', kwargs={'nano_cpus': int(6000000000.0), 'mem_limit': '6g', 'log_config': {'type': JSON, 'config': {'max-size': '100m', 'max-file': '10'}}}), 'gpu': InferenceServiceRuntimeConfig(image=NOS_DOCKER_IMAGE_GPU, name=f'{NOS_INFERENCE_SERVICE_CONTAINER_NAME}-gpu', device='gpu', kwargs={'nano_cpus': int(8000000000.0), 'mem_limit': '12g', 'log_config': {'type': JSON, 'config': {'max-size': '100m', 'max-file': '10'}}}), 'trt': InferenceServiceRuntimeConfig(image='autonomi/nos:latest-trt', name=f'{NOS_INFERENCE_SERVICE_CONTAINER_NAME}-trt', device='gpu', kwargs={'nano_cpus': int(8000000000.0), 'mem_limit': '12g', 'log_config': {'type': JSON, 'config': {'max-size': '100m', 'max-file': '10'}}}), 'inf2': InferenceServiceRuntimeConfig(image='autonomi/nos:latest-inf2', name=f'{NOS_INFERENCE_SERVICE_CONTAINER_NAME}-inf2', device='inf2', environment=_default_environment({'NEURON_RT_VISIBLE_CORES': 2}), kwargs={'nano_cpus': int(8000000000.0), 'log_config': {'type': JSON, 'config': {'max-size': '100m', 'max-file': '10'}}})}