HW-aware execution

The model manager is the main workhorse of the NOS server. It is responsible for managing the lifecycle of models, executing inference requests across a pool of workers that may live on heterogenous hardware devices. The model manager is responsible for the following:

  • 📡 Remote Model Execution: The model manager is responsible for the remote inference / execution of models concurrently ac across a pool of workers. We use Ray to distribute the inference workloads. In addition to this, the manager also maintains a the functional specification of the model allowing NOS to optionally trace and compile relevant code-paths that are critical for low-latency execution. This allows developers to further optimize and increase performance of their models without explicitly generating compilation or traced artifacts through these ahead-of-time tool chains. In essence, the model manager provides a logical handle for remote model execution while leveraging the appropriate hardware devices available for model inference (CPUs, GPUs, ASICs).

  • 🚀 Distributed Inference: On top of remote model execution, NOS leverages Ray to scale models across large number of GPUs/HW-accelerators to fully utilize large compute clusters, and auto-scaling models based on load. This allows NOS to scale models across multiple nodes, and leverage the full compute capacity of the cluster.

  • 💪 Automatic Batched Execution: By profiling the model, we can determine the optimal batch size for maximum throughput. We can then batch inference requests to maximize throughput. The model manager takes care of batching requests implicitly, allowing users to simply submit tasks asynchronously while the model manager effectively dispatches the inference tasks to an appropriately scaled pool of workers.

  • 🔧 Tuneable Perforamnce: One of the difficulties with using Pytorch is that you would have manaully tune your models for the underlying HW without much visibility into the memory profiles of each model. The onus is on the developer to manually profile each model they run, and appropriately allocate the model on the device memory. This becomes especially inconvenient when dealing with tens of models that may have different memory and execution profiles. With the model manager, we make this transparent to the user so that they can fully utilize their underlying HW without having to manaully tune their model with every deployment.

  • 🧠 Device Memory Management: Large models can take up a lot of device memory, and it is important to free up memory when it is no longer needed. The model manager is responsible for garbage collecting models that are no longer needed. We implement a FIFO eviction policy for garbage collection, but this can be extended to other policies such as LRU, LFU etc. While FIFO (first-in first-out) is a simple and predictable policy, it is also the most effective in practice.

  • 🧰 Model Compilation: Coming Soon!