Running custom models
Advanced topic
This guide is for advanced users of the NOS server-side custom model registry. If you're looking for a way to quickly define your custom model and runtime for serving purposes, we recommend you go through the serving custom models guide first.
In this guide, we will walk through how to run custom models with NOS. We will use the OpenAI CLIP model from the popular HuggingFace library to load the model, and then use nos to wrap and execute the model at scale.
👩💻 Defining the custom model¶
Here we're using the popular OpenAI CLIP for extracting embeddings using the Huggingface transformers CLIPModel.
📦 Wrapping the custom model¶
In the section below, we'll show you a straightforward way to wrap the CLIP model with nos and run it at scale. In theory, you can wrap any custom Python class that is serializable with cloudpickle. Models are wrapped with the ModelSpec class, which is a serializable specification of a model. In this example, we'll use the ModelSpec.from_cls method to wrap the CLIP model.
As you can see, we can use the ModelHandle to call the underlying methods encode_image and encode_text just like we would with the original CLIP class. The ModelHandle is a logical handle for the model that allows us to run the model at scale without having to worry about the underlying details of the model.
🚀 Scaling the model¶
Once the model handle has been created, we can also use it to scale the model across multiple GPUs, or even multiple nodes. ModelHandle exposes a scale() method that allows you to manually specify the number of replicas to scale the model. Optionally, you can also specify a more advanced NOS feature where the number of replicas is automatically inferred based on the memory overhead of the model via scale(replicas="auto").
We continue considering the example above and scale the model to 4 replicas. In order to use all the underlying replicas effectively, we need to ensure that the calls to the underlying methods encode_image and encode_text are no longer blocking. In other words, we need to ensure that the calls to the underlying methods are asynchronous and can fully utilize the model replicas without blocking on each other. NOS provides a few convenience methods to submit tasks and retrieve results asynchronously using it's handle.results API.
In the example above, we load images from a video file and asynchronously submit encode_image tasks to the 4 replicas we trivially created using the handle.scale(replicas=4) call. We showed how you could implement a strawman, yet performant imap implementation that asynchronously submits tasks to the underlying replicas and yields the results as they become available. This allows us to fully utilize the underlying replicas without blocking on each other, and thereby fully utilizing the underlying hardware.
🛠️ Running models in a custom runtime environment¶
For custom models that require execution in a custom runtime environment (e.g. with TensorRT or other library dependencies), we can specify the runtime environment via the runtime_env argument in the ModelSpec.
For more details about custom runtime environments, please see the runtime environments section.