Skip to content

infra

Introducing the NOS Inferentia2 Runtime

We are excited to announce the availability of the AWS Inferentia2 runtime on NOS - a.k.a. our inf2 runtime. This runtime is designed to easily serve models on AWS Inferentia2, a high-performance, purpose-built chip for inference. In this blog post, we will introduce the AWS Inferentia2 runtime, and show you how to trivially deploy a model on the AWS Inferentia2 device using NOS. If you have followed our previous tutorial on serving LLMs on a budget (on NVIDIA hardware), you will be pleasantly surprised to see how easy it is to deploy a model on the AWS Inferentia2 device using the pre-baked NOS inf2 runtime we provide.

⚑️ What is AWS Inferentia2?

AWS Inferentia2 (Inf2 for short) is the second-generation inference accelerator from AWS. Inf2 instances raise the performance of Inf1 (originally launched in 2019) by delivering 3x higher compute performance, 4x larger total accelerator memory, up to 4x higher throughput, and up to 10x lower latency. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between accelerators.

Relative to the AWS G5 instances (NVIDIA A10G), Inf2 instances promise up to 50% better performance-per-watt. Inf2 instances are ideal for applications such as natural language processing, recommender systems, image classification and recognition, speech recognition, and language translation that can take advantage of scale-out distributed inference.

Instance Size Inf2 Accelerators Accelerator Memory (GB) vCPU Memory (GiB) On-Demand Price Spot Price
inf2.xlarge 1 32 4 16 $0.76 $0.32
inf2.8xlarge 1 32 32 128 $1.97 $0.79
inf2.24xlarge 6 192 96 384 $6.49 $2.45
inf2.48xlarge 12 384 192 768 $12.98 $5.13

πŸƒβ€β™‚οΈ NOS Inference Runtime

The NOS inference server supports custom runtime environments through the use of the InferenceServiceRuntime class - a high-level interface for defining new containerized and hardware-aware runtime environments. NOS already ships with runtime environments for NVIDIA GPUs (gpu) and Intel/ARM CPUs (cpu). Today, we're adding the NOS Inferentia2 runtime (inf2) with the AWS Neuron drivers, the AWS Neuron SDK and NOS pre-installed. This allows developers to quickly develop applications for AWS Inferentia2, without wasting any precious time on the complexities of setting up the AWS Neuron SDK and the AWS Inferentia2 driver environments.

πŸ“¦ Deploying a PyTorch model on Inferentia2 with NOS

Deploying PyTorch models on AWS Inferentia2 chips presents a unique set of challenges, distinct from the experience with NVIDIA GPUs. This is primarily due to the static graph execution requirement of ASICs, requiring the user to trace and compile models ahead-of-time, making them less accessible to entry-level developers. In some cases, custom model tracing and compilation are essential steps to fully utilize the AWS Inferentia2 accelerators. This demands a deep understanding of the HW-specific deployment/compiler toolchain (TensorRT, AWS Neuron SDK), the captured and data-dependent traced PyTorch graph, and the underlying HW-specific kernel/op-support to name just a few challenges.

Simplifying AI hardware access with NOS

NOS aims to bridge this gap and streamline the deployment process, making it more even accessible for both entry-level and expert developers to leverage the powerful inference capabilities of AWS Inferentia2 for their inference needs.

1. Define your custom inf2 model

In this example, we'll be using the inf2/embeddings sentence embedding tutorial on NOS. First, we'll define our custom EmbeddingServiceInf2 model models/embeddings_inf2.py and a serve.yaml serve specification that will be used by NOS to serve our model on the AWS Inferentia2 device. The relevant files are shown below:

Directory structure of nos/examples/inf2/embeddings
$ tree .
β”œβ”€β”€ job-inf2-embeddings-deployment.yaml
β”œβ”€β”€ models
β”‚Β Β  └── embeddings_inf2.py  (1)
β”œβ”€β”€ README.md
β”œβ”€β”€ serve.yaml  (2)
└── tests
    β”œβ”€β”€ test_embeddings_inf2_client.py  (3)
    └── test_embeddings_inf2.py
  1. Main python module that defines the EmbeddingServiceInf2 model.
  2. The serve.yaml serve specification that defines the custom inf2 runtime and registers the EmbeddingServiceInf2 model with NOS.
  3. The pytest test for calling the EmbeddingServiceInf2 service via gRPC.

The embeddings interface is defined in the EmbeddingServiceInf2 module in models/embeddings_inf2.py, where the __call__ method returns the embedding of the text prompt using BAAI/bge-small-en-v1.5 embedding model.

2. Define the custom inf2 runtime with the NOS serve specification

The serve.yaml serve specification defines the custom embedding model, and a custom inf2 runtime that NOS uses to execute our model. Follow the annotations below to understand the different components of the serve specification.

serve.yaml
images:
  embeddings-inf2:
    base: autonomi/nos:latest-inf2  (1)
    env:
      NOS_LOGGING_LEVEL: DEBUG
      NOS_NEURON_CORES: 2
    run:
      - python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
      - pip install sentence-transformers  (2)

models:
  BAAI/bge-small-en-v1.5:
    model_cls: EmbeddingServiceInf2
    model_path: models/embeddings_inf2.py
    default_method: __call__
    runtime_env: embeddings-inf2  (3)
  1. Specifies the base runtime image to use - we use the pre-baked autonomi/nos:latest-inf2 runtime image to build our custom runtime image. This custom NOS runtime comes pre-installed with the AWS Neuron drivers and the AWS Neuron SDK.
  2. Installs the sentence-transformers library, which is used to embed the text prompt using the BAAI/bge-small-en-v1.5 model.
  3. Specifies the custom runtime environment to use for the specific model deployment - embeddings-inf2 - which is used to execute the EmbeddingServiceInf2 model.

In this example, we'll be using the Huggingface Optimum library to help us simplify the deployment process to the Inf2 chip. However, for custom model architectures and optimizations, we have built our own PyTorch tracer and compiler for a growing list of popular models on the Huggingface Hub.

Need support for custom models on AWS Inferentia2?

If you're interested in deploying a custom model on the AWS Inferentia2 chip with NOS, please reach out to us on our GitHub Issues page or at support@autonomi.ai, and we'll be happy to help you out.

3. Deploy the embedding service on AWS inf2.xlarge with SkyPilot

Now that we have defined our custom model, let's deploy this service on AWS Inferentia2 using SkyPilot. In this example, we're going to use SkyPilot's sky launch command to deploy our NOS service on an AWS inf2.xlarge on-demand instance.

Before we launch the service, let's look at the job-inf2-embeddings-deployment.yaml file that we will use to provision the inf2 instance and deploy the EmbeddingServiceInf2 model.

job-inf2-embeddings-deployment.yaml
file_mounts: (1)
  /app: .

resources:
  cloud: aws
  region: us-west-2
  instance_type: inf2.xlarge (2)
  image_id: ami-096319086cc3d5f23 # us-west-2 (3)
  disk_size: 256
  ports: 8000

setup: |
  sudo apt-get install -y docker-compose-plugin

  cd /app
  cd /app && python3 -m venv .venv && source .venv/bin/activate
  pip install git+https://github.com/autonomi-ai/nos.git pytest (4)

run: |
  source /app/.venv/bin/activate
  cd /app && nos serve up -c serve.yaml --http (5)
  1. Mounts the local ./app directory so that the serve.yaml, models/ and tests/ directories are available on the remote instance.
  2. Specifies the AWS Inferentia2 instance type to use - we use the inf2.xlarge instance type.
  3. Specifies the Amazon Machine Instance (AMI) use that come pre-installed with AWS Neuron drivers.
  4. We simply need pytest for testing the client-side logic in tests/test_embeddings_inf2_client.py
  5. Starts the NOS server with the serve.yaml specification. The runtime flag --runtime inf2 is optional, and automatically detected by NOS as illustrated here.

Provisioning inf2.xlarge instances

To provision an inf2.xlarge instance, you will need to have an AWS account and the necessary service quotas set for the inf2 instance nodes. For more information on service quotas, please refer to the AWS documentation.

Using SkyPilot with inf2 instances

Due to a job submission bug in the SkyPilot CLI for inf2 instances, you will need to use the skypilot-nightly[aws] (pip install skypilot-nightly[aws]) package to provision inf2 instances correctly with the sky launch command below.

Let's deploy the inf2 embeddings service using the following command:

sky launch -c inf2-embeddings-service job-inf2-embeddings-deployment.yaml

sky launch output

You should see the following output from the sky launch command:

(nos-infra-py38) inf2/embeddings spillai-desktop [ sky launch -c inf2-embeddings-service job-inf2-embeddings-deployment.yaml
Task from YAML spec: job-inf2-embeddings-deployment.yaml
I 01-31 21:48:06 optimizer.py:694] == Optimizer ==
I 01-31 21:48:06 optimizer.py:717] Estimated cost: $0.8 / hour
I 01-31 21:48:06 optimizer.py:717]
I 01-31 21:48:06 optimizer.py:840] Considered resources (1 node):
I 01-31 21:48:06 optimizer.py:910] ------------------------------------------------------------------------------------------
I 01-31 21:48:06 optimizer.py:910]  CLOUD   INSTANCE      vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
I 01-31 21:48:06 optimizer.py:910] ------------------------------------------------------------------------------------------
I 01-31 21:48:06 optimizer.py:910]  AWS     inf2.xlarge   4       16        Inferentia:1   us-west-2     0.76          βœ”
I 01-31 21:48:06 optimizer.py:910] ------------------------------------------------------------------------------------------
I 01-31 21:48:06 optimizer.py:910]
Launching a new cluster 'inf2-embeddings-service'. Proceed? [Y/n]: y
I 01-31 21:48:18 cloud_vm_ray_backend.py:4389] Creating a new cluster: 'inf2-embeddings-service' [1x AWS(inf2.xlarge, {'Inferentia': 1}, image_id={'us-west-2': 'ami-096319086cc3d5f23'}, ports=['8000'])].
I 01-31 21:48:18 cloud_vm_ray_backend.py:4389] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 01-31 21:48:18 cloud_vm_ray_backend.py:1386] To view detailed progress: tail -n100 -f /home/spillai/sky_logs/sky-2024-01-31-21-48-06-108390/provision.log
I 01-31 21:48:19 provisioner.py:79] Launching on AWS us-west-2 (us-west-2a,us-west-2b,us-west-2c,us-west-2d)
I 01-31 21:49:37 provisioner.py:429] Successfully provisioned or found existing instance.
I 01-31 21:51:03 provisioner.py:531] Successfully provisioned cluster: inf2-embeddings-service
I 01-31 21:51:04 cloud_vm_ray_backend.py:4418] Processing file mounts.
I 01-31 21:51:05 cloud_vm_ray_backend.py:4450] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-01-31-21-48-06-108390/file_mounts.log
I 01-31 21:51:05 backend_utils.py:1286] Syncing (to 1 node): . -> ~/.sky/file_mounts/app
I 01-31 21:51:06 cloud_vm_ray_backend.py:3158] Running setup on 1 node.
...
(task, pid=23904) βœ“ Launching docker compose with command: docker compose -f
(task, pid=23904) /home/ubuntu/.nos/tmp/serve/docker-compose.app.yml up
(task, pid=23904)  Network serve_default  Creating
(task, pid=23904)  Network serve_default  Created
(task, pid=23904)  Container serve-nos-server-1  Creating
(task, pid=23904)  Container serve-nos-server-1  Created
(task, pid=23904)  Container serve-nos-http-gateway-1  Creating
(task, pid=23904)  Container serve-nos-http-gateway-1  Created
(task, pid=23904) Attaching to serve-nos-http-gateway-1, serve-nos-server-1
(task, pid=23904) serve-nos-http-gateway-1  | WARNING:  Current configuration will not reload as not all conditions are met, please refer to documentation.
(task, pid=23904) serve-nos-server-1        |  βœ“ InferenceExecutor :: Backend initializing (as daemon) ...
(task, pid=23904) serve-nos-server-1        |  βœ“ InferenceExecutor :: Backend initialized (elapsed=2.9s).
(task, pid=23904) serve-nos-server-1        |  βœ“ InferenceExecutor :: Connected to backend.
(task, pid=23904) serve-nos-server-1        |  βœ“ Starting gRPC server on [::]:50051
(task, pid=23904) serve-nos-server-1        |  βœ“ InferenceService :: Deployment complete (elapsed=0.0s)
(task, pid=23904) serve-nos-server-1        | (EmbeddingServiceInf2 pid=404) 2024-01-31 21:53:58.566 | INFO     | nos.neuron.device:setup_environment:36 - Setting up neuron env with 2 cores
...
(task, pid=23904) serve-nos-server-1        | (EmbeddingServiceInf2 pid=404) 2024-02-01T05:54:36Z Compiler status PASS
(task, pid=23904) serve-nos-server-1        | (EmbeddingServiceInf2 pid=404) 2024-01-31 21:54:46.928 | INFO     | EmbeddingServiceInf2:__init__:61 - Saved model to /app/.nos/cache/neuron/BAAI/bge-small-en-v1.5-bs-1-sl-384
(task, pid=23904) serve-nos-server-1        | (EmbeddingServiceInf2 pid=404) 2024-01-31 21:54:47.037 | INFO     | EmbeddingServiceInf2:__init__:64 - Loaded neuron model: BAAI/bge-small-en-v1.5
...
(task, pid=23904) serve-nos-server-1        | 2024-01-31 22:25:43.710 | INFO     | nos.server._service:Run:360 - Executing request [model=BAAI/bge-small-en-v1.5, method=None]
(task, pid=23904) serve-nos-server-1        | 2024-01-31 22:25:43.717 | INFO     | nos.server._service:Run:362 - Executed request [model=BAAI/bge-small-en-v1.5, method=None, elapsed=7.1ms]

Once complete, you should see the following (trimmed) output from the sky launch command:

βœ“ InferenceExecutor :: Backend initializing (as daemon) ...
βœ“ InferenceExecutor :: Backend initialized (elapsed=2.9s).
βœ“ InferenceExecutor :: Connected to backend.
βœ“ Starting gRPC server on [::]:50051
βœ“ InferenceService :: Deployment complete (elapsed=0.0s)
Setting up neuron env with 2 cores
...
Compiler status PASS
Saved model to /app/.nos/cache/neuron/BAAI/bge-small-en-v1.5-bs-1-sl-384
Loaded neuron model: BAAI/bge-small-en-v1.5

3. Test your custom model on AWS Inf2 instance

Once the service is deployed, you should be able to simply make a cURL request to the inf2 instance to test the server-side logic of the embeddings model.

export IP=$(sky status --ip inf2-embeddings-service)

curl \
-X POST http://${IP}:8000/v1/infer \
-H 'Content-Type: application/json' \
-d '{
    "model_id": "BAAI/bge-small-en-v1.5",
    "inputs": {
        "texts": ["fox jumped over the moon"]
    }
}'

Optionally, you can also test the gRPC service using the provided tests/test_embeddings_inf2_client.py. For this test however, you'll need to ssh into the inf2 instance and run the following command.

ssh inf2-embeddings-service

Once you're on the inf2.xlarge instance, you can run pytest -sv tests/test_embeddings_inf2_client.py to test the server-side logic of the embeddings model.

$ pytest -sv tests/test_embeddings_inf2_client.py

Here's a simplified version of the test to execute the embeddings model.

from nos.client import Client

# Create the client
client = Client("[::]:50051")
assert client.WaitForServer()

# Load the embeddings model
model = client.Module("BAAI/bge-small-en-v1.5")

# Embed text with the model
texts = "What is the meaning of life?"
response = model(texts=texts)

πŸ€‘ What's it going to cost me?

The table below shows the costs of deploying one of these latency-optimized (bsize=1) embedding services on a single Inf2 instance on AWS. While the costs are only one part of the equation, it is important to note that the AWS Inf2 instances are ~25% cheaper than the NVIDIA A10G instances, and offer a more cost-effective solution for inference workloads on AWS. In the coming weeks, we'll be digging into evaluating the performance of the Inf2 instances with respect to their NVIDIA GPU counterparts on inference metrics such as latency/throughput and cost metrics such as number of requests / $, montly costs etc.

Model Cloud Instance Spot Cost / hr Cost / month # of Req. / $
BAAI/bge-small-en-v1.5 inf2.xlarge - $0.75 ~$540 ~685K / $1
BAAI/bge-small-en-v1.5 inf2.xlarge βœ… $0.32 ~$230 ~1.6M / $1

🎁 Wrapping up

In this post, we introduced the new NOS inf2 runtime that allows developers to easily develop, and serve models on the AWS Inferentia2 chip. With more cost-efficient, and inference-optimized chips coming to market (Google TPUs, Groq, Tenstorrent etc), we believe it is important for developers to be able to easily access and deploy models on these devices. The specialized NOS Inference Runtime aims to do just that - a fast, and frictionless way to deploy models onto any of the AI accelerators, be it NVIDIA GPUs or AWS Inferentia2 chips, in the cloud, or on-prem.

Thanks for reading, and we hope you found this post useful - and finally, give NOS a try. If you have any questions, or would like to learn more about the NOS inf2 runtime, please reach out to us on our GitHub Issues page or join us on Discord.

Serving LLMs on a budget

Deploying Large Language Models (LLMs) and Mixture of Experts (MoEs) are all the rage today, and for good reason. They are the most powerful and closest open-source models in terms of performance to OpenAI GPT-3.5 today. However, it turns out that deploying these models can still be somewhat of a lift for most ML engineers and researchers, both in terms of engineering work and operational costs. For example, the recently announced Mixtral 8x7B requires 2x NVIDIA A100-80G GPUs, which can cost upwards of $5000 / month (on-demand) on CSPs.

With recent advancements in model compression, quantization and model mixing, we are now seeing an exciting race unfold to deploy these expert models on a budget, without sacrificing significantly on performance. In this blog post, we'll show you how to deploy the mlabonne/phixtral-4x2_8 model on a single NVIDIA L4 GPU for under $160 / month and easily scale-out a dirt-cheap, dedicated inference service of your own. We'll be using SkyPilot to deploy and manage our NOS service on spot (pre-emptible) instances, making them especially cost-efficient.

🧠 What is Phixtral?

Inspired inspired by the mistralai/Mixtral-8x7B-v0.1 architecture, mlabonne/phixtral-4x2_8 is the first Mixure of Experts (MoE) made with 4 microsoft/phi-2 models that was recently MIT licensed. The general idea behind mixture-of-experts is to combine the capabilities of multiple models to achieve better performance than each individual model. They are significantly more memory-efficient for inference too, but that's a post for a later date. In this case, we combine the capabilities of 4 microsoft/phi-2 models to achieve better performance than each of the individual 2.7B parameter models it's composed of.

Breakdown of the mlabonne/phixtral-4x2_8 model

Here's the breakdown of the 4 models that make up the mlabonne/phixtral-4x2_8 model:

base_model: cognitivecomputations/dolphin-2_6-phi-2
gate_mode: cheap_embed
experts:
  - source_model: cognitivecomputations/dolphin-2_6-phi-2
    positive_prompts: [""]
  - source_model: lxuechen/phi-2-dpo
    positive_prompts: [""]
  - source_model: Yhyu13/phi-2-sft-dpo-gpt4_en-ep1
    positive_prompts: [""]
  - source_model: mrm8488/phi-2-coder
    positive_prompts: [""]

You can go to the original model card here for more details on how the model was merged using mergekit.

Now, let's take a look at the performance of the mlabonne/phixtral-4x2_8 model on the Nous Suite compared to other models in the 2.7B parameter range.

Model AGIEval GPT4All TruthfulQA Bigbench Average
mlabonne/phixtral-4x2_8 33.91 70.44 48.78 37.82 47.78
dolphin-2_6-phi-2 33.12 69.85 47.39 37.2 46.89
phi-2-dpo 30.39 71.68 50.75 34.9 46.93
phi-2 27.98 70.8 44.43 35.21 44.61

πŸ’Έ Serving Phixtral on a budget with SkyPilot and NOS

Let's now see how we can deploy the mlabonne/phixtral-4x2_8 model on a single NVIDIA L4 GPU for under $160 / month. We'll be using SkyPilot to deploy and manage our NOS service on spot (pre-emptible) instances, making them especially cost-efficient.

What's SkyPilot?

If you're new to SkyPilot, we recommend you go through our NOS x SkyPilot integration page first to familiarize yourself with the tool.

1. Define your custom model and serve specification

In this example, we'll be using the llm-streaming-chat tutorial on NOS playground. First, we'll define our custom phixtral chat model phixtral_chat.py and a serve.phixtral.yaml serve specification that will be used by NOS to serve our model. The relevant files are shown below:

(nos-py38) nos-playground/examples/llm-streaming-chat $ tree .
β”œβ”€β”€ models
β”‚Β Β  └── phixtral_chat.py
β”œβ”€β”€ serve.phixtral.yaml

The entire chat interface is defined in the StreamingChat module in phixtral_chat.py, where the chat method returns a string iterable for the gRPC / HTTP server to stream back model predictions to the client.

The serve.phixtral.yaml serve specification defines the custom chat model, and a custom runtime that NOS uses to execute our model. Follow the annotations below to understand the different components of the serve specification.

serve.phixtral.yaml
images: (1)
  llm-py310-cu121: (2)
    base: autonomi/nos:latest-py310-cu121 (3)
    pip: (4)
      - bitsandbytes
      - transformers
      - einops
      - accelerate

models: (5)
  mlabonne/phixtral-4x2_8: (6)
    model_cls: StreamingChat (7)
    model_path: models/phixtral_chat.py (8)
    init_kwargs:
      model_name: mlabonne/phixtral-4x2_8
    default_method: chat
    runtime_env: llm-gpu
    deployment: (9)
      resources:
        device: auto
        device_memory: 7Gi (10)
  1. Specifies the custom runtime images that will be used to serve our model.
  2. Specifies the name of the custom runtime image (referenced below in runtime_env).
  3. Specifies the base NOS image to use for the custom runtime image. We provide a few pre-built images on dockerhub.
  4. Specifies the pip dependencies to install in the custom runtime image.
  5. Specifies all the custom models we intend to serve.
  6. Specifies the unique name of the custom model (model identifier).
  7. Specifies the model class to use for the custom model.
  8. Specifies the path to the model class definition.
  9. Specifies the deployment resources needed for the custom model.
  10. Specifies the GPU memory to allocate for the custom model.

2. Test your custom model locally with NOS

In order to start the NOS server locally, we can simply run the following command:

nos serve up -c serve.phixtral.yaml --http

This will build the custom runtime image, and start the NOS server locally, exposing an OpenAI compatible HTTP proxy on port 8000. This will allow you to chat with your custom LLM endpoint using any OpenAI API compatible client.

3. Deploy your NOS service with SkyPilot

Now that we have defined our serve YAML specification, let's deploy this service on GCP using SkyPilot. In this example, we're going to use SkyPilot's sky serve command to deploy our NOS service on spot (pre-emptible) instances on GCP.

Deploy on any cloud provider (AWS, Azure, GCP, OCI, Lambda Labs, etc.)

SkyPilot supports deploying NOS services on any cloud provider. In this example, we're going to use GCP, but you can easily deploy on AWS, Azure, or any other cloud provider of your choice. You can override gcp by providing the --cloud flag to sky serve up.

Let's define a serving configuration in a service-phixtral.sky.yaml file. This YAML specification will be used by SkyPilot to deploy and manage our NOS service on pre-emptible instances, automatically provisioning and recovering from failovers, setting up new instances when needed on server pre-emptions.

service-phixtral.sky.yaml
name: service-phixtral

file_mounts:
  /app: ./app (1)

resources:
  cloud: gcp
  accelerators: L4:1
  use_spot: True (2)
  ports:
    - 8000

service:
  readiness_probe: (3)
    path: /v1/health 
  replicas: 2 (4)

setup: |
  sudo apt-get install -y docker-compose-plugin
  pip install torch-nos

run: |
  cd /app && nos serve up -c serve.phixtral.yaml --http (5)
  1. Setup file-mounts to mount the local ./app directory so that the serve.phixtral.yaml and models/ directory are available on the remote instance.
  2. Use spot (pre-emptible) instances instead of on-demand instances.
  3. Define the readiness probe path for the service. This allows the SkyPilot controller to check the health of the service and recover from failures if needed.
  4. Define the number of replicas to deploy.
  5. Define the run command to execute on each replica. In this case, we're simply starting the NOS server with the phixtral model deployed on init.

To deploy our NOS service, we can simply run the following command:

sky serve up -n service-mixtral service-mixtral.sky.yaml

SkyPilot will automatically pick the cheapest region and zone to deploy our service, and provision the necessary cloud resources to deploy the NOS server. In this case, you'll notice that SkyPilot provisioned two NVIDIA L4 GPU instances on GCP in the us-central1-a availability zone.

You should see the following output:

(nos-infra-py38) deployments/deploy-llms-with-skypilot $ sky serve up -n service-mixtral service-mixtral.sky.yaml
Service from YAML spec: service-mixtral.sky.yaml
Service Spec:
Readiness probe method:           GET /v1/health
Readiness initial delay seconds:  1200
Replica autoscaling policy:       Fixed 2 replicas
Replica auto restart:             True
Each replica will use the following resources (estimated):
I 01-19 16:01:58 optimizer.py:694] == Optimizer ==
I 01-19 16:01:58 optimizer.py:705] Target: minimizing cost
I 01-19 16:01:58 optimizer.py:717] Estimated cost: $0.2 / hour
I 01-19 16:01:58 optimizer.py:717]
I 01-19 16:01:58 optimizer.py:840] Considered resources (1 node):
I 01-19 16:01:58 optimizer.py:910] ----------------------------------------------------------------------------------------------------
I 01-19 16:01:58 optimizer.py:910]  CLOUD   INSTANCE              vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
I 01-19 16:01:58 optimizer.py:910] ----------------------------------------------------------------------------------------------------
I 01-19 16:01:58 optimizer.py:910]  GCP     g2-standard-4[Spot]   4       16        L4:1           us-central1-a   0.22          βœ”
I 01-19 16:01:58 optimizer.py:910] ----------------------------------------------------------------------------------------------------
I 01-19 16:01:58 optimizer.py:910]
Launching a new service 'service-mixtral'. Proceed? [Y/n]: y
Launching controller for 'service-mixtral'
...
I 01-19 16:02:14 cloud_vm_ray_backend.py:1912] Launching on GCP us-west1 (us-west1-a)
I 01-19 16:02:30 log_utils.py:45] Head node is up.
I 01-19 16:03:03 cloud_vm_ray_backend.py:1717] Successfully provisioned or found existing VM.
I 01-19 16:03:05 cloud_vm_ray_backend.py:4558] Processing file mounts.
...
I 01-19 16:03:20 cloud_vm_ray_backend.py:3325] Setup completed.
I 01-19 16:03:29 cloud_vm_ray_backend.py:3422] Job submitted with Job ID: 11

Service name: service-mixtral
Endpoint URL: XX.XXX.X.XXX:30001
To see detailed info:           sky serve status service-mixtral [--endpoint]
To teardown the service:        sky serve down service-mixtral

To see logs of a replica:       sky serve logs service-mixtral [REPLICA_ID]
To see logs of load balancer:   sky serve logs --load-balancer service-mixtral
To see logs of controller:      sky serve logs --controller service-mixtral

To monitor replica status:      watch -n10 sky serve status service-mixtral
To send a test request:         curl -L XX.XX.X.XXX:30001

Once the service is deployed, you can get the IP address of the SkyPilot service via:.

sky serve status service-phixtral --endpoint

We'll refer to <sky-serve-ip> as the load balancer's IP address, that takes the full form of <sky-serve-ip>:30001. You should now be able to ping the load-balancer endpoint directly with cURL and see the following output:

$ curl -L http://<sky-serve-ip>:30001/v1/health
{"status":"ok"}

πŸ’¬ Chatting with your custom Phixtral service

You're now ready to chat with your hosted, custom LLM endpoint! Here's a quick demo of the mlabonne/phixtral-4x2_8 model served with NOS across 2 spot (pre-emptible) instances.

On the top, you'll see the logs from both the serve replicas, and the corresponding chats that are happening concurrently on the bottom. SkyPilot handles the load-balancing and routing of requests to the replicas, while NOS handles the custom model serving and streaming inference. Below, we show you how you can chat with your hosted LLM endpoint using cURL, an OpenAI compatible client, or the OpenAI Python client.

curl \
-X POST -L http://<sky-serve-ip>:30001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "mlabonne/phixtral-4x2_8",
    "messages": [{"role": "user", "content": "Tell me a joke in 300 words"}],
    "temperature": 0.7, "stream": true
  }'

Below, we show how you can use any OpenAI API compatible client to chat with your hosted LLM endpoint. We will use the popular llm CLI tool from Simon Willison to chat with our hosted LLM endpoint.

# Install the llm CLI tool
$ pip install llm

# Install the llm-nosrun plugin to talk to your service
$ llm install llm-nosrun

# List the models
$ llm models list

# Chat with your endpoint
$ NOSRUN_API_BASE=http://<sky-serve-ip>:30001/v1 llm -m mlabonne/phixtral-4x2_8 "Tell me a joke in 300 words"

Below, we show how you can use the OpenAI Python Client to chat with your hosted LLM endpoint.

import openai

# Create a stream and print the output
client = openai.OpenAI(api_key="no-key-required", base_url=f"http://<sky-serve-ip>:30001/v1")
stream = client.chat.completions.create(
    model="mlabonne/phixtral-4x2_8",
    messages=[{"role": "user", "content": "Tell me a joke in 300 words"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

Info

In these examples, we use SkyPilot's load-balancer port 30001 which redirects HTTP traffic to one of the many NOS replicas (on port 8000) in a round-robin fashion. This allows us to scale-out our service to multiple replicas, and load-balance requests across them.

πŸ€‘ What's it going to cost me?

In the example above, we were able to deploy the mlabonne/phixtral-4x2_8 model on a single NVIDIA L4 GPU for $0.22 / hour / replica, or $160 / month / replica. This is a 45x improvement over the cost of deploying the mistralai/Mixtral-8x7B-v0.1 model on 2x NVIDIA A100-80G GPUs, which can cost upwards of $7000 / month (on-demand) on CSPs. As advancements in model compression, quantization and model mixing continue to improve, we expect more users to be able to fine-tune, distill and deploy these expert small-langauge models on a budget, without sacrificing significantly on performance.

The table below shows the costs of deploying one of these popular MoE LLM models on a single GPU server on GCP. As you can see, the cost of deploying a single model can range from $500 to $7300 / month, depending on the model and of course CSP (kept fixed here).

Model Cloud Provider GPU VRAM Spot Cost / hr Cost / month
mistralai/Mixtral-8x7B-v0.1 GCP 2x NVIDIA A100-80G ~94β€―GB - $10.05 ~$7236
TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ GCP NVIDIA A100-40G ~25GB - $3.67 ~$2680
mlabonne/phixtral-4x2_8 GCP NVIDIA L4 ~9GB - $0.70 ~$500
mlabonne/phixtral-4x2_8 GCP NVIDIA L4 ~9GB βœ… $0.22 ~$160

However, the onus is on the developer to figure out the right instance type, spot instance strategy, and the right number of replicas to deploy to ensure that the service is both cost-efficient and performant. In the coming weeks, we're going to be introducing some exciting tools to help developers alleviate this pain and provide transparency to make the right infrastructure decisions for their services. Stay tuned!

🎁 Wrapping up

In this blog post, we showed you how to deploy the mlabonne/phixtral-4x2_8 model on a single NVIDIA L4 GPU for under $160 / month and scale-out a dirt-cheap inference service of your own. We used SkyPilot to deploy and manage our NOS service on spot (pre-emptible) instances, making them especially cost-efficient.

In our next blog post, we’ll take it one step further. We'll explore how you can serve multiple models on the same GPU so that your infrastructure costs don’t have to scale with the number of models you serve. The TL;DR is that you will soon be able to serve multiple models with fixed and predictable pricing, making model serving more accessible and cost-efficient than ever before.


πŸ“š Getting started with NOS tutorials

We are thrilled to announce a new addition to our resources - the NOS Tutorials! This series of tutorials is designed to empower users with the knowledge and tools needed to leverage NOS for serving models efficiently and effectively. Whether you're a seasoned developer or just starting out, our tutorials offer insights into various aspects of using NOS, making your journey with model serving a breeze.

Over the next few weeks, we'll walk you through the process of using NOS to serve models, from the basics to more advanced topics. We'll also cover how to use NOS in a production environment, ensuring you have all the tools you need to take your projects to the next level. Finally, keep yourself updated on NOS by giving us a 🌟 on Github.

Can't wait? Show me the code!

If you can't wait to get started, head over to our tutorials page on Github to dive right in to the code!

🌟 What’s Inside the NOS Tutorials?

The NOS Tutorials encompass a wide range of topics, each focusing on different facets of model serving. Here's a sneak peek into what you can expect:

1. Serving custom models: 01-serving-custom-models

Dive into the world of custom GPU models with NOS. This tutorial shows you how easy it is to wrap your PyTorch code with NOS, and serve them via a REST / gRPC API.

2. Serving multiple methods: 02-serving-multiple-methods

Learn how to expose several custom methods of a model for serving. This tutorial is perfect for those looking to tailor their model's functionality to specific requirements, enhancing its utility and performance.

3. Serve LLMs with streaming support: 03-llm-streaming-chat

Get hands-on with serving an LLM with streaming support. This tutorial focuses on using TinyLlama/TinyLlama-1.1B-Chat-v0.1, showcasing how to implement streaming capabilities with NOS for smoother, more efficient language model interactions.

4. Serve multiple models on the same GPU: 04-serving-multiple-models

Step up your game by serving multiple models on the same GPU. This tutorial explores the integration of models like TinyLlama/TinyLlama-1.1B-Chat-v0.1 and distil-whisper/distil-small.en, enabling multi-modal applications such as audio transcription combined with summarization on a single GPU.

5. Serving models in production with Docker 05-serving-with-docker

Enter the realm of production environments with our Docker tutorial. This guide is essential for anyone looking to use NOS in a more structured, scalable environment. You'll learn how to deploy your production NOS images with Docker and Docker Compose, ensuring your model serving works with existing ML infrastructure as reliably as possible.

Stay tuned!

πŸ”— Stay tuned, as we'll continuously update the section with more tutorials and resources to keep you ahead in the ever-evolving world of model serving!

Happy Model Serving!


This blog post is brought to you by the NOS Team - committed to making model serving fast, efficient, and accessible to all!

Introducing NOS Blog!

At Autonomi AI, we build infrastructure tools to make AI fast, easy and affordable. We’re in the early development years of the β€œLinux OS for AI”, where the commoditization of open-source models and tools will be the critical to the safe and ubiquitous use of AI. Needless to say, it’s the most exciting and ambitious infrastructure project our generation is going to witness in the coming decade.

A few weeks back, we open-sourced NOS - a fast and flexible inference server forΒ PyTorchΒ that can run a whole host of open-source AI models (LLMs, Stable Diffusion, CLIP, Whisper, Object Detection etc) all under one-roof. Today, we’re finally excited to launch the NOS blog.

🎯 Why are we building yet another AI inference server?

Most inference API implementations today deeply couple the API framework (FastAPI, Flask) with the modeling backend (PyTorch, TF etc) - in other words, it doesn’t let you separate the concerns for the AI backend (e.g. AI hardware, drivers, model compilation, execution runtime, scale out, memory efficiency, async/batched execution, multi-model management etc) from your AI application (e.g. auth, observability, telemetry, web integrations etc), especially if you’re looking to build a production-ready application.

We’ve made it very easy for developers to host new PyTorch models as APIs and take them to production without having to worry about any of the backend infrastructure concerns. We build on some awesome projects like FastAPI, Ray, Hugging Face, transformers and diffusers.

We’ve been big believers of multi-modal from the very beginning, and you can do all of it with NOS today.Β Give us a 🌟 on Github if you're stoked -- NOS can run locally on your Linux desktop (with a gaming GPU), in any cloud GPU (NVIDIA L4, A100s, etc) and even on CPUs. Very soon, we'll support running models on Apple Silicon and custom AI accelerators such as Inferentia2 from Amazon Web Services (AWS).

What's coming?

Over the coming weeks, we’ll be announcing some awesome features that we believe will make the power of large foundation models more accessible, cheaper and easy-to-use than ever before.

πŸ₯œ NOS, in a nutshell

NOS was built from the ground-up, with developers in mind. Here are a few things we think developers care about:

  • πŸ₯· Flexible: Support for OSS models with custom runtimes with pip, conda and cuda/driver dependencies.
  • πŸ”Œ Pluggable: Simple API over a high-performance gRPC or REST API that supports batched requests, and streaming.
  • πŸš€ Scalable: Serve multiple custom models simultaneously on a single or multi-GPU instance, without having to worry about memory management and model scaling.
  • πŸ›οΈ Local: Local execution means that you control your data, and you’re free to build NOS for domains that are more restrictive with data-privacy.
  • ☁️ Cloud-agnostic: Fully containerized means that you can develop, test and deploy NOS locally, on-prem, on any cloud or AI CSP.
  • πŸ“¦ Extensible: Written entirely in Python so it’s easily hackable and extensible with an Apache-2.0 License for commercial use.

Go ahead and check out our playground, and try out some of the more recent models with NOS.