SkyPilot

In this guide we'll show you how you can deploy the NOS inference server using SkyPilot on any of the popular Cloud Service Providers (CSPs) such as AWS, GCP or Azure. We'll use GCP as an example, but the steps are similar for other CSPs.

What is SkyPilot?

SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution. - SkyPilot Documentation.

👩‍💻 Prerequisites¶

You'll first need to install SkyPilot in your virtual environment / conda environment before getting started. Before getting started, we recommend you go through their quickstart to familiarize yourself with the SkyPilot tool.

$ pip install skypilot[gcp]

If you're installing SkyPilot for use with other cloud providers, you may install any of the relevant extras skypilot[aws,gcp,azure]. `

[OPTIONAL] Configure cloud credentials¶

Run sky check for more details and installation instructions.

📦 Deploying NOS on GCP¶

1. Define your SkyPilot deployment YAML¶

First, let's create a sky.yaml YAML file with the following configuration.

# NOS GPU server deployment on T4
# Usage: sky launch -c nos-server sky.yaml

name: nos-server

resources:
  accelerators: T4:1
  cloud: gcp
  ports:
    - 8000

setup: |
  # Setup conda environment
  conda init bash
  conda create -n nos python=3.10 -y
  conda activate nos

  # Install docker compose plugin
  sudo apt-get install -y docker-compose-plugin

  # Install torch-nos
  pip install torch-nos

run: |
  # Activate conda environment
  conda activate nos

  # Run the server (gRPC + HTTP)
  nos serve up --http

Here, we are going to provision a single GPU server on GCP with an NVIDIA T4 GPU and expose ports 8000 (REST) and 50051 (gRPC) for the NOS server.

🚀 2. Launch your NOS server¶

Now, we can launch our NOS server on GCP with the following command:

$ sky launch -c nos-server sky.yaml

That's it! You should see the following output:

(nos-infra-py38) examples/skypilot spillai-desktop [ sky launch -c nos-server sky.yaml --cloud gcp
Task from YAML spec: sky.yaml
I 01-16 09:41:18 optimizer.py:694] == Optimizer ==
I 01-16 09:41:18 optimizer.py:705] Target: minimizing cost
I 01-16 09:41:18 optimizer.py:717] Estimated cost: $0.6 / hour
I 01-16 09:41:18 optimizer.py:717]
I 01-16 09:41:18 optimizer.py:840] Considered resources (1 node):
I 01-16 09:41:18 optimizer.py:910] ---------------------------------------------------------------------------------------------
I 01-16 09:41:18 optimizer.py:910]  CLOUD   INSTANCE       vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
I 01-16 09:41:18 optimizer.py:910] ---------------------------------------------------------------------------------------------
I 01-16 09:41:18 optimizer.py:910]  GCP     n1-highmem-4   4       26        T4:1           us-central1-a   0.59          ✔
I 01-16 09:41:18 optimizer.py:910] ---------------------------------------------------------------------------------------------
I 01-16 09:41:18 optimizer.py:910]
Launching a new cluster 'nos-server'. Proceed? [Y/n]: y
I 01-16 09:41:25 cloud_vm_ray_backend.py:4508] Creating a new cluster: 'nos-server' [1x GCP(n1-highmem-4, {'T4': 1}, ports=['8000', '50051'])].
I 01-16 09:41:25 cloud_vm_ray_backend.py:4508] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 01-16 09:41:26 cloud_vm_ray_backend.py:1474] To view detailed progress: tail -n100 -f /home/spillai/sky_logs/sky-2024-01-16-09-41-16-157615/provision.log
I 01-16 09:41:29 cloud_vm_ray_backend.py:1912] Launching on GCP us-central1 (us-central1-a)
I 01-16 09:44:36 log_utils.py:45] Head node is up.
I 01-16 09:45:43 cloud_vm_ray_backend.py:1717] Successfully provisioned or found existing VM.
I 01-16 09:46:00 cloud_vm_ray_backend.py:4558] Processing file mounts.
I 01-16 09:46:00 cloud_vm_ray_backend.py:4590] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-01-16-09-41-16-157615/file_mounts.log
I 01-16 09:46:00 backend_utils.py:1459] Syncing (to 1 node): ./app -> ~/.sky/file_mounts/app
I 01-16 09:46:05 cloud_vm_ray_backend.py:3315] Running setup on 1 node.
...
...
...
(nos-server, pid=12112) Status: Downloaded newer image for autonomi/nos:0.1.4-gpu
(nos-server, pid=12112) docker.io/autonomi/nos:0.1.4-gpu
(nos-server, pid=12112) 2024-01-16 17:49:09.415 | INFO     | nos.server:_pull_image:235 - Pulled new server image: autonomi/nos:0.1.4-gpu
(nos-server, pid=12112) ✓ Successfully generated docker-compose file
(nos-server, pid=12112) (filename=docker-compose.sky_workdir.yml).
(nos-server, pid=12112) ✓ Launching docker compose with command: docker compose -f
(nos-server, pid=12112) /home/gcpuser/.nos/tmp/serve/docker-compose.sky_workdir.yml up
(nos-server, pid=12112)  Container serve-nos-server-1  Creating
(nos-server, pid=12112)  Container serve-nos-server-1  Created
(nos-server, pid=12112)  Container serve-nos-http-gateway-1  Creating
(nos-server, pid=12112)  Container serve-nos-http-gateway-1  Created
(nos-server, pid=12112) Attaching to serve-nos-http-gateway-1, serve-nos-server-1
(nos-server, pid=12112) serve-nos-server-1        | Starting server with OMP_NUM_THREADS=4...
(nos-server, pid=12112) serve-nos-http-gateway-1  | WARNING:  Current configuration will not reload as not all conditions are met, please refer to documentation.
(nos-server, pid=12112) serve-nos-server-1        |  ✓ InferenceExecutor :: Connected to backend.
(nos-server, pid=12112) serve-nos-server-1        |  ✓ Starting gRPC server on [::]:50051
(nos-server, pid=12112) serve-nos-server-1        |  ✓ InferenceService :: Deployment complete (elapsed=0.0s)
(nos-server, pid=12112) serve-nos-http-gateway-1  | INFO:     Started server process [1]
(nos-server, pid=12112) serve-nos-http-gateway-1  | INFO:     Waiting for application startup.
(nos-server, pid=12112) serve-nos-http-gateway-1  | INFO:     Application startup complete.
(nos-server, pid=12112) serve-nos-http-gateway-1  | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

🔋 3. Check the status of your NOS server¶

You can check the status of your NOS server with the following command:

$ sky status

You should see the following output:

NAME                            LAUNCHED     RESOURCES                                                                  STATUS   AUTOSTOP  COMMAND
nos-server                      1 min ago    1x GCP(n1-highmem-4, {'T4': 1}, ports=[8000, 50051])                       UP       -         sky launch -c nos-server-...

Congratulations! You've successfully deployed your NOS server on GCP. You can now access the NOS server from your local machine at <ip>:8000 or <ip>:50051. In a new terminal, let's check the health of our NOS server with the following command:

$ curl http://$(sky status --ip nos-server):8000/v1/health

You should see the following output:

{"status": "ok"}

💬 4. Chat with your hosted LLM endpoint¶

You can now chat with your hosted LLM endpoint. Since NOS exposes an OpenAI compatible API via it's /v1/chat/completions route, you can use any OpenAI compatible client to chat with your hosted LLM endpoint.

Using cURLUsing an OpenAI compatible clientUsing the OpenAI Python client

curl \
    -X POST http://$(sky status --ip nos-server):8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "messages": [{"role": "user", "content": "Tell me a joke in 300 words"}],
        "temperature": 0.7, "stream": true
      }'

Below, we show how you can use any OpenAI API compatible client to chat with your hosted LLM endpoint. We will use the popular llm CLI tool from Simon Willison to chat with our hosted LLM endpoint.

# Install the llm CLI tool
$ pip install llm

# Install the llm-nosrun plugin to talk to your service
$ llm install llm-nosrun

# List the models
$ llm models list

# Chat with your endpoint
$ NOSRUN_API_BASE=http://$(sky status --ip nos-server):8000/v1 llm -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 "Tell me a joke in 300 words."

Note: You can also change the NOSRUN_API_BASE to http://localhost:8000/v1 to talk to your local NOS server.

Below, we show how you can use the OpenAI Python Client to chat with your hosted LLM endpoint.

import subprocess

import openai


# Get the output of `sky status --ip nos-server` with subprocess
address = subprocess.check_output(["sky", "status", "--ip", "nos-server"]).decode("utf-8").strip()
print(f"Using address: {address}")

# Create a stream and print the output
client = openai.OpenAI(api_key="no-key-required", base_url=f"http://{address}:8000/v1")
stream = client.chat.completions.create(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    messages=[{"role": "user", "content": "Tell me a joke in 300 words"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

On the first call to the server, the server will download the model from Huggingface, cache it locally and load it onto the GPU. Subsequent calls will not have any of this overhead as the GPU memory for the models will be pinned.

🛑 5. Stop / Terminate your NOS server¶

Once you're done using your server, you can stop it with the following command:

$ sky stop nos-server-gcp

Alternatively, you can terminate your server with the following command:

$ sky down nos-server-gcp

This will terminate the server and all associated resources (e.g. VMs, disks, etc.).