Running custom models
Advanced topic
This guide is for advanced users of the NOS server-side custom model registry. If you're looking for a way to quickly define your custom model and runtime for serving purposes, we recommend you go through the serving custom models guide first.
In this guide, we will walk through how to run custom models with NOS. We will use the OpenAI CLIP model from the popular HuggingFace library to load the model, and then use nos
to wrap and execute the model at scale.
👩💻 Defining the custom model¶
Here we're using the popular OpenAI CLIP for extracting embeddings using the Huggingface transformers
CLIPModel
.
📦 Wrapping the custom model¶
In the section below, we'll show you a straightforward way to wrap the CLIP model with nos
and run it at scale. In theory, you can wrap any custom Python class that is serializable with cloudpickle
. Models are wrapped with the ModelSpec
class, which is a serializable specification of a model. In this example, we'll use the ModelSpec.from_cls
method to wrap the CLIP model.
As you can see, we can use the ModelHandle
to call the underlying methods encode_image
and encode_text
just like we would with the original CLIP
class. The ModelHandle
is a logical handle for the model that allows us to run the model at scale without having to worry about the underlying details of the model.
🚀 Scaling the model¶
Once the model handle has been created, we can also use it to scale the model across multiple GPUs, or even multiple nodes. ModelHandle
exposes a scale()
method that allows you to manually specify the number of replicas to scale the model. Optionally, you can also specify a more advanced NOS feature where the number of replicas is automatically inferred based on the memory overhead of the model via scale(replicas="auto")
.
We continue considering the example above and scale the model to 4 replicas. In order to use all the underlying replicas effectively, we need to ensure that the calls to the underlying methods encode_image
and encode_text
are no longer blocking. In other words, we need to ensure that the calls to the underlying methods are asynchronous and can fully utilize the model replicas without blocking on each other. NOS provides a few convenience methods to submit
tasks and retrieve results asynchronously using it's handle.results
API.
In the example above, we load images from a video file and asynchronously submit encode_image
tasks to the 4 replicas we trivially created using the handle.scale(replicas=4)
call. We showed how you could implement a strawman, yet performant imap
implementation that asynchronously submits tasks to the underlying replicas and yields the results as they become available. This allows us to fully utilize the underlying replicas without blocking on each other, and thereby fully utilizing the underlying hardware.
🛠️ Running models in a custom runtime environment¶
For custom models that require execution in a custom runtime environment (e.g. with TensorRT
or other library dependencies), we can specify the runtime environment via the runtime_env
argument in the ModelSpec
.
For more details about custom runtime environments, please see the runtime environments section.